Submitted by:
| # | Name | Id | |
|---|---|---|---|
| Student 1 | Avia Avraham | 206417701 | avia.avraham@campus.technion.ac.il |
| Student 2 | Daniel Perez | 212056741 | daniel.pe@campus.technion.ac.il |
Introduction¶
In this assignment we'll learn to generate text with a deep multilayer RNN network based on GRU cells. Then we'll focus our attention on image generation using a variational autoencoder. We will then shift our focus to sentiment analysis: First by training a transformer-style encoder, and then by fine-tuning a pre-trained model from Hugging Face.
General Guidelines¶
- Please read the getting started page on the course website. It explains how to setup, run and submit the assignment.
- This assignment requires running on GPU-enabled hardware. Please read the course servers usage guide. It explains how to use and run your code on the course servers to benefit from training with GPUs.
- The text and code cells in these notebooks are intended to guide you through the assignment and help you verify your solutions. The notebooks do not need to be edited at all (outside of a small code block in part 4). The only exception is to fill your name(s) in the above cell before submission, and implementing a small code block in Part 4. Please do not remove sections or change the order of any cells.
- All your code (and even answers to questions) should be written in the files
within the python package corresponding the assignment number (
hw1,hw2, etc). You can of course use any editor or IDE to work on these files.
$$ \newcommand{\mat}[1]{\boldsymbol {#1}} \newcommand{\mattr}[1]{\boldsymbol {#1}^\top} \newcommand{\matinv}[1]{\boldsymbol {#1}^{-1}} \newcommand{\vec}[1]{\boldsymbol {#1}} \newcommand{\vectr}[1]{\boldsymbol {#1}^\top} \newcommand{\rvar}[1]{\mathrm {#1}} \newcommand{\rvec}[1]{\boldsymbol{\mathrm{#1}}} \newcommand{\diag}{\mathop{\mathrm {diag}}} \newcommand{\set}[1]{\mathbb {#1}} \newcommand{\norm}[1]{\left\lVert#1\right\rVert} \newcommand{\pderiv}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\bb}[1]{\boldsymbol{#1}} $$
Part 1: Sequence Models¶
In this part we will learn about working with text sequences using recurrent neural networks. We'll go from a raw text file all the way to a fully trained GRU-RNN model and generate works of art!
import unittest
import os
import sys
import pathlib
import urllib
import shutil
import re
import numpy as np
import torch
import matplotlib.pyplot as plt
%load_ext autoreload
%autoreload 2
test = unittest.TestCase()
plt.rcParams.update({'font.size': 12})
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('Using device:', device)
Using device: cuda
Text generation with a char-level RNN¶
Obtaining the corpus¶
Let's begin by downloading a corpus containing all the works of William Shakespeare. Since he was very prolific, this corpus is fairly large and will provide us with enough data for obtaining impressive results.
CORPUS_URL = 'https://github.com/cedricdeboom/character-level-rnn-datasets/raw/master/datasets/shakespeare.txt'
DATA_DIR = pathlib.Path.home().joinpath('.pytorch-datasets')
def download_corpus(out_path=DATA_DIR, url=CORPUS_URL, force=False):
pathlib.Path(out_path).mkdir(exist_ok=True)
out_filename = os.path.join(out_path, os.path.basename(url))
if os.path.isfile(out_filename) and not force:
print(f'Corpus file {out_filename} exists, skipping download.')
else:
print(f'Downloading {url}...')
with urllib.request.urlopen(url) as response, open(out_filename, 'wb') as out_file:
shutil.copyfileobj(response, out_file)
print(f'Saved to {out_filename}.')
return out_filename
corpus_path = download_corpus()
Corpus file /home/avia.avraham/.pytorch-datasets/shakespeare.txt exists, skipping download.
Load the text into memory and print a snippet:
with open(corpus_path, 'r', encoding='utf-8') as f:
corpus = f.read()
print(f'Corpus length: {len(corpus)} chars')
print(corpus[7:1234])
Corpus length: 6347703 chars
ALLS WELL THAT ENDS WELL
by William Shakespeare
Dramatis Personae
KING OF FRANCE
THE DUKE OF FLORENCE
BERTRAM, Count of Rousillon
LAFEU, an old lord
PAROLLES, a follower of Bertram
TWO FRENCH LORDS, serving with Bertram
STEWARD, Servant to the Countess of Rousillon
LAVACHE, a clown and Servant to the Countess of Rousillon
A PAGE, Servant to the Countess of Rousillon
COUNTESS OF ROUSILLON, mother to Bertram
HELENA, a gentlewoman protected by the Countess
A WIDOW OF FLORENCE.
DIANA, daughter to the Widow
VIOLENTA, neighbour and friend to the Widow
MARIANA, neighbour and friend to the Widow
Lords, Officers, Soldiers, etc., French and Florentine
SCENE:
Rousillon; Paris; Florence; Marseilles
ACT I. SCENE 1.
Rousillon. The COUNT'S palace
Enter BERTRAM, the COUNTESS OF ROUSILLON, HELENA, and LAFEU, all in black
COUNTESS. In delivering my son from me, I bury a second husband.
BERTRAM. And I in going, madam, weep o'er my father's death anew;
but I must attend his Majesty's command, to whom I am now in
ward, evermore in subjection.
LAFEU. You shall find of the King a husband, madam; you, sir, a
father. He that so generally is at all times good must of
Data Preprocessing¶
The first thing we'll need is to map from each unique character in the corpus to an index that will represent it in our learning process.
TODO: Implement the char_maps() function in the hw3/charnn.py module.
import hw3.charnn as charnn
char_to_idx, idx_to_char = charnn.char_maps(corpus)
print(char_to_idx)
test.assertEqual(len(char_to_idx), len(idx_to_char))
test.assertSequenceEqual(list(char_to_idx.keys()), list(idx_to_char.values()))
test.assertSequenceEqual(list(char_to_idx.values()), list(idx_to_char.keys()))
{'\n': 0, ' ': 1, '!': 2, '"': 3, '$': 4, '&': 5, "'": 6, '(': 7, ')': 8, ',': 9, '-': 10, '.': 11, '0': 12, '1': 13, '2': 14, '3': 15, '4': 16, '5': 17, '6': 18, '7': 19, '8': 20, '9': 21, ':': 22, ';': 23, '<': 24, '?': 25, 'A': 26, 'B': 27, 'C': 28, 'D': 29, 'E': 30, 'F': 31, 'G': 32, 'H': 33, 'I': 34, 'J': 35, 'K': 36, 'L': 37, 'M': 38, 'N': 39, 'O': 40, 'P': 41, 'Q': 42, 'R': 43, 'S': 44, 'T': 45, 'U': 46, 'V': 47, 'W': 48, 'X': 49, 'Y': 50, 'Z': 51, '[': 52, ']': 53, '_': 54, 'a': 55, 'b': 56, 'c': 57, 'd': 58, 'e': 59, 'f': 60, 'g': 61, 'h': 62, 'i': 63, 'j': 64, 'k': 65, 'l': 66, 'm': 67, 'n': 68, 'o': 69, 'p': 70, 'q': 71, 'r': 72, 's': 73, 't': 74, 'u': 75, 'v': 76, 'w': 77, 'x': 78, 'y': 79, 'z': 80, '}': 81, '\ufeff': 82}
Seems we have some strange characters in the corpus that are very rare and are probably due to mistakes. To reduce the length of each tensor we'll need to later represent our chars, it's best to remove them.
TODO: Implement the remove_chars() function in the hw3/charnn.py module.
corpus, n_removed = charnn.remove_chars(corpus, ['}','$','_','<','\ufeff'])
print(f'Removed {n_removed} chars')
# After removing the chars, re-create the mappings
char_to_idx, idx_to_char = charnn.char_maps(corpus)
Removed 34 chars
The next thing we need is an embedding of the chracters.
An embedding is a representation of each token from the sequence as a tensor.
For a char-level RNN, our tokens will be chars and we can thus use the simplest possible embedding: encode each char as a one-hot tensor. In other words, each char will be represented
as a tensor whos length is the total number of unique chars (V) which contains all zeros except at the index
corresponding to that specific char.
TODO: Implement the functions chars_to_onehot() and onehot_to_chars() in the hw3/charnn.py module.
# Wrap the actual embedding functions for calling convenience
def embed(text):
return charnn.chars_to_onehot(text, char_to_idx)
def unembed(embedding):
return charnn.onehot_to_chars(embedding, idx_to_char)
text_snippet = corpus[3104:3148]
print(text_snippet)
print(embed(text_snippet[0:3]))
test.assertEqual(text_snippet, unembed(embed(text_snippet)))
test.assertEqual(embed(text_snippet).dtype, torch.int8)
brine a maiden can season her praise in.
tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0]], dtype=torch.int8)
Dataset Creation¶
We wish to train our model to generate text by constantly predicting what the next char should be based on the past. To that end we'll need to train our recurrent network in a way similar to a classification task. At each timestep, we input a char and set the expected output (label) to be the next char in the original sequence.
We will split our corpus into shorter sequences of length S chars (see question below).
Each sample we provide our model with will therefore be a tensor of shape (S,V) where V is the embedding dimension. Our model will operate sequentially on each char in the sequence.
For each sample, we'll also need a label. This is simply another sequence, shifted by one char so that the label of each char is the next char in the corpus.
TODO: Implement the chars_to_labelled_samples() function in the hw3/charnn.py module.
# Create dataset of sequences
seq_len = 64
vocab_len = len(char_to_idx)
# Create labelled samples
samples, labels = charnn.chars_to_labelled_samples(corpus, char_to_idx, seq_len, device)
print(f'samples shape: {samples.shape}')
print(f'labels shape: {labels.shape}')
# Test shapes
num_samples = (len(corpus) - 1) // seq_len
test.assertEqual(samples.shape, (num_samples, seq_len, vocab_len))
test.assertEqual(labels.shape, (num_samples, seq_len))
# Test content
for _ in range(1000):
# random sample
i = np.random.randint(num_samples, size=(1,))[0]
# Compare to corpus
test.assertEqual(unembed(samples[i]), corpus[i*seq_len:(i+1)*seq_len], msg=f"content mismatch in sample {i}")
# Compare to labels
sample_text = unembed(samples[i])
label_text = str.join('', [idx_to_char[j.item()] for j in labels[i]])
test.assertEqual(sample_text[1:], label_text[0:-1], msg=f"label mismatch in sample {i}")
samples shape: torch.Size([99182, 64, 78]) labels shape: torch.Size([99182, 64])
Let's print a few consecutive samples. You should see that the text continues between them.
import re
import random
i = random.randrange(num_samples-5)
for i in range(i, i+5):
test.assertEqual(len(samples[i]), seq_len)
s = re.sub(r'\s+', ' ', unembed(samples[i])).strip()
print(f'sample [{i}]:\n\t{s}')
sample [52157]: the eastern gate, all fiery red, Opening on Neptune with fai sample [52158]: r blessed beams, Turns into yellow gold his salt green strea sample [52159]: ms. But, notwithstanding, haste, make no delay; We may e sample [52160]: ffect this business yet ere day. Exit OBERON PUCK. sample [52161]: Up and down, up and down, I will lead them up and
As usual, instead of feeding one sample at a time into our model's forward we'll work with batches of samples. This means that at every timestep, our model will operate on a batch of chars that are from different sequences. Effectively this will allow us to parallelize training our model by dong matrix-matrix multiplications instead of matrix-vector during the forward pass.
An important nuance is that we need the batches to be contiguous, i.e. sample $k$ in batch $j$ should continue sample $k$ from batch $j-1$. The following figure illustrates this:

If we naïvely take consecutive samples into batches, e.g. [0,1,...,B-1], [B,B+1,...,2B-1] and so on, we won't have contiguous
sequences at the same index between adjacent batches.
To accomplish this we need to tell our DataLoader which samples to combine together into one batch.
We do this by implementing a custom PyTorch Sampler, and providing it to our DataLoader.
TODO: Implement the SequenceBatchSampler class in the hw3/charnn.py module.
from hw3.charnn import SequenceBatchSampler
sampler = SequenceBatchSampler(dataset=range(32), batch_size=10)
sampler_idx = list(sampler)
print('sampler_idx =\n', sampler_idx)
# Test the Sampler
test.assertEqual(len(sampler_idx), 30)
batch_idx = np.array(sampler_idx).reshape(-1, 10)
for k in range(10):
test.assertEqual(np.diff(batch_idx[:, k], n=2).item(), 0)
sampler_idx = [0, 3, 6, 9, 12, 15, 18, 21, 24, 27, 1, 4, 7, 10, 13, 16, 19, 22, 25, 28, 2, 5, 8, 11, 14, 17, 20, 23, 26, 29]
/home/avia.avraham/miniconda3/envs/cs236781-hw/lib/python3.8/site-packages/torch/utils/data/sampler.py:65: UserWarning: `data_source` argument is not used and will be removed in 2.2.0.You may still have custom implementation that utilizes it.
warnings.warn("`data_source` argument is not used and will be removed in 2.2.0."
Even though we're working with sequences, we can still use the standard PyTorch Dataset/DataLoader combo.
For the dataset we can use a built-in class, TensorDataset to return tuples of (sample, label)
from the samples and labels tensors we created above.
The DataLoader will be provided with our custom Sampler so that it generates appropriate batches.
import torch.utils.data
# Create DataLoader returning batches of samples.
batch_size = 32
ds_corpus = torch.utils.data.TensorDataset(samples, labels)
sampler_corpus = SequenceBatchSampler(ds_corpus, batch_size)
dl_corpus = torch.utils.data.DataLoader(ds_corpus, batch_size=batch_size, sampler=sampler_corpus, shuffle=False)
Let's see what that gives us:
print(f'num batches: {len(dl_corpus)}')
x0, y0 = next(iter(dl_corpus))
print(f'shape of a batch of samples: {x0.shape}')
print(f'shape of a batch of labels: {y0.shape}')
num batches: 3100
shape of a batch of samples: torch.Size([32, 64, 78]) shape of a batch of labels: torch.Size([32, 64])
Now lets look at the same sample index from multiple batches taken from our corpus.
# Check that sentences in in same index of different batches complete each other.
k = random.randrange(batch_size)
for j, (X, y) in enumerate(dl_corpus,):
print(f'=== batch {j}, sample {k} ({X[k].shape}): ===')
s = re.sub(r'\s+', ' ', unembed(X[k])).strip()
print(f'\t{s}')
if j==4: break
=== batch 0, sample 12 (torch.Size([64, 78])): === e well. I know not, gentlemen, what you intend, Who else === batch 1, sample 12 (torch.Size([64, 78])): === must be let blood, who else is rank. If I myself, there is === batch 2, sample 12 (torch.Size([64, 78])): === no hour so fit As Caesar's death's hour, nor no instrument === batch 3, sample 12 (torch.Size([64, 78])): === Of half that worth as those your swords, made rich With t === batch 4, sample 12 (torch.Size([64, 78])): === he most noble blood of all this world. I do beseech ye, if y
Model Implementation¶
Finally, our data set is ready so we can focus on our model.
We'll implement here is a multilayer gated recurrent unit (GRU) model, with dropout. This model is a type of RNN which performs similar to the well-known LSTM model, but it's somewhat easier to train because it has less parameters. We'll modify the regular GRU slightly by applying dropout to the hidden states passed between layers of the model.
The model accepts an input $\mat{X}\in\set{R}^{S\times V}$ containing a sequence of embedded chars. It returns an output $\mat{Y}\in\set{R}^{S\times V}$ of predictions for the next char and the final hidden state $\mat{H}\in\set{R}^{L\times H}$. Here $S$ is the sequence length, $V$ is the vocabulary size (number of unique chars), $L$ is the number of layers in the model and $H$ is the hidden dimension.
Mathematically, the model's forward function at layer $k\in[1,L]$ and timestep $t\in[1,S]$ can be described as
$$ \begin{align} \vec{z_t}^{[k]} &= \sigma\left(\vec{x}^{[k]}_t {\mattr{W}_{\mathrm{xz}}}^{[k]} + \vec{h}_{t-1}^{[k]} {\mattr{W}_{\mathrm{hz}}}^{[k]} + \vec{b}_{\mathrm{z}}^{[k]}\right) \\ \vec{r_t}^{[k]} &= \sigma\left(\vec{x}^{[k]}_t {\mattr{W}_{\mathrm{xr}}}^{[k]} + \vec{h}_{t-1}^{[k]} {\mattr{W}_{\mathrm{hr}}}^{[k]} + \vec{b}_{\mathrm{r}}^{[k]}\right) \\ \vec{g_t}^{[k]} &= \tanh\left(\vec{x}^{[k]}_t {\mattr{W}_{\mathrm{xg}}}^{[k]} + (\vec{r_t}^{[k]}\odot\vec{h}_{t-1}^{[k]}) {\mattr{W}_{\mathrm{hg}}}^{[k]} + \vec{b}_{\mathrm{g}}^{[k]}\right) \\ \vec{h_t}^{[k]} &= \vec{z}^{[k]}_t \odot \vec{h}^{[k]}_{t-1} + \left(1-\vec{z}^{[k]}_t\right)\odot \vec{g_t}^{[k]} \end{align} $$
The input to each layer is, $$ \mat{X}^{[k]} = \begin{bmatrix} {\vec{x}_1}^{[k]} \\ \vdots \\ {\vec{x}_S}^{[k]} \end{bmatrix} = \begin{cases} \mat{X} & \mathrm{if} ~k = 1~ \\ \mathrm{dropout}_p \left( \begin{bmatrix} {\vec{h}_1}^{[k-1]} \\ \vdots \\ {\vec{h}_S}^{[k-1]} \end{bmatrix} \right) & \mathrm{if} ~1 < k \leq L+1~ \end{cases}. $$
The output of the entire model is then, $$ \mat{Y} = \mat{X}^{[L+1]} {\mattr{W}_{\mathrm{hy}}} + \mat{B}_{\mathrm{y}} $$
and the final hidden state is $$ \mat{H} = \begin{bmatrix} {\vec{h}_S}^{[1]} \\ \vdots \\ {\vec{h}_S}^{[L]} \end{bmatrix}. $$
Notes:
- $t\in[1,S]$ is the timestep, i.e. the current position within the sequence of each sample.
- $\vec{x}_t^{[k]}$ is the input of layer $k$ at timestep $t$, respectively.
- The outputs of the last layer $\vec{y}_t^{[L]}$, are the predicted next characters for every input char. These are similar to class scores in classification tasks.
- The hidden states at the last timestep, $\vec{h}_S^{[k]}$, are the final hidden state returned from the model.
- $\sigma(\cdot)$ is the sigmoid function, i.e. $\sigma(\vec{z}) = 1/(1+e^{-\vec{z}})$ which returns values in $(0,1)$.
- $\tanh(\cdot)$ is the hyperbolic tangent, i.e. $\tanh(\vec{z}) = (e^{2\vec{z}}-1)/(e^{2\vec{z}}+1)$ which returns values in $(-1,1)$.
- $\vec{h_t}^{[k]}$ is the hidden state of layer $k$ at time $t$. This can be thought of as the memory of that layer.
- $\vec{g_t}^{[k]}$ is the candidate hidden state for time $t+1$.
- $\vec{z_t}^{[k]}$ is known as the update gate. It combines the previous state with the input to determine how much the current state will be combined with the new candidate state. For example, if $\vec{z_t}^{[k]}=\vec{1}$ then the current input has no effect on the output.
- $\vec{r_t}^{[k]}$ is known as the reset gate. It combines the previous state with the input to determine how much of the previous state will affect the current state candidate. For example if $\vec{r_t}^{[k]}=\vec{0}$ the previous state has no effect on the current candidate state.
Here's a graphical representation of the GRU's forward pass at each timestep. The $\vec{\tilde{h}}$ in the image is our $\vec{g}$ (candidate next state).

You can see how the reset and update gates allow the model to completely ignore it's previous state, completely ignore it's input, or any mixture of those states (since the gates are actually continuous and between $(0,1)$).
Here's a graphical representation of the entire model. You can ignore the $c_t^{[k]}$ (cell state) variables (which are relevant for LSTM models). Our model has only the hidden state, $h_t^{[k]}$. Also notice that we added dropout between layers (i.e., on the up arrows).

The purple tensors are inputs (a sequence and initial hidden state per layer), and the green tensors are outputs (another sequence and final hidden state per layer). Each blue block implements the above forward equations. Blocks that are on the same vertical level are at the same layer, and therefore share parameters.
TODO:implement MultilayerGRU class in the hw3/charnn.py module.
Notes:
- We use batches now. The math is identical to the above, but all the tensors will have an extra batch dimension as their first dimension.
- Read the diagram above, try to understand all the dimentions.
in_dim = vocab_len
h_dim = 256
n_layers = 3
model = charnn.MultilayerGRU(in_dim, h_dim, out_dim=in_dim, n_layers=n_layers)
model = model.to(device)
print(model)
# Test forward pass
y, h = model(x0.to(dtype=torch.float, device=device))
print(f'y.shape={y.shape}')
print(f'h.shape={h.shape}')
test.assertEqual(y.shape, (batch_size, seq_len, vocab_len))
test.assertEqual(h.shape, (batch_size, n_layers, h_dim))
test.assertEqual(len(list(model.parameters())), 9 * n_layers + 2)
MultilayerGRU(
(dropout): Dropout(p=0, inplace=False)
(layer_params): ModuleList(
(0): ModuleDict(
(wxz): Linear(in_features=78, out_features=256, bias=True)
(whz): Linear(in_features=256, out_features=256, bias=False)
(wxr): Linear(in_features=78, out_features=256, bias=True)
(whr): Linear(in_features=256, out_features=256, bias=False)
(wxg): Linear(in_features=78, out_features=256, bias=True)
(whg): Linear(in_features=256, out_features=256, bias=False)
)
(1-2): 2 x ModuleDict(
(wxz): Linear(in_features=256, out_features=256, bias=True)
(whz): Linear(in_features=256, out_features=256, bias=False)
(wxr): Linear(in_features=256, out_features=256, bias=True)
(whr): Linear(in_features=256, out_features=256, bias=False)
(wxg): Linear(in_features=256, out_features=256, bias=True)
(whg): Linear(in_features=256, out_features=256, bias=False)
)
)
(output_layer): Linear(in_features=256, out_features=78, bias=True)
)
y.shape=torch.Size([32, 64, 78]) h.shape=torch.Size([32, 3, 256])
Generating text by sampling¶
Now that we have a model, we can implement text generation based on it. The idea is simple: At each timestep our model receives one char $x_t$ from the input sequence and outputs scores $y_t$ for what the next char should be. We'll convert these scores into a probability over each of the possible chars. In other words, for each input char $x_t$ we create a probability distribution for the next char conditioned on the current one and the state of the model (representing all previous inputs): $$p(x_{t+1}|x_t, \vec{h}_t).$$
Once we have such a distribution, we'll sample a char from it. This will be the first char of our generated sequence. Now we can feed this new char into the model, create another distribution, sample the next char and so on. Note that it's crucial to propagate the hidden state when sampling.
The important point however is how to create the distribution from the scores. One way, as we saw in previous ML tasks, is to use the softmax function. However, a drawback of softmax is that it can generate very diffuse (more uniform) distributions if the score values are very similar. When sampling, we would prefer to control the distributions and make them less uniform to increase the chance of sampling the char(s) with the highest scores compared to the others.
To control the variance of the distribution, a common trick is to add a hyperparameter $T$, known as the temperature to the softmax function. The class scores are simply scaled by $T$ before softmax is applied: $$ \mathrm{softmax}_T(\vec{y}) = \frac{e^{\vec{y}/T}}{\sum_k e^{y_k/T}} $$
A low $T$ will result in less uniform distributions and vice-versa.
TODO: Implement the hot_softmax() function in the hw3/charnn.py module.
scores = y[0,0,:].detach()
_, ax = plt.subplots(figsize=(15,5))
for t in reversed([0.3, 0.5, 1.0, 100]):
ax.plot(charnn.hot_softmax(scores, temperature=t).cpu().numpy(), label=f'T={t}')
ax.set_xlabel('$x_{t+1}$')
ax.set_ylabel('$p(x_{t+1}|x_t)$')
ax.legend()
uniform_proba = 1/len(char_to_idx)
uniform_diff = torch.abs(charnn.hot_softmax(scores, temperature=100) - uniform_proba)
test.assertTrue(torch.all(uniform_diff < 1e-4))
TODO: Implement the generate_from_model() function in the hw3/charnn.py module.
for _ in range(3):
text = charnn.generate_from_model(model, "foobar", 50, (char_to_idx, idx_to_char), T=0.5)
print(text)
test.assertEqual(len(text), 50)
foobarI7L&lcT1bt987UTL):m;vo3Xu1SxblEJV5K]f.(M!PWN foobarLO?ujSdhP9Xw6LR7qJCcu rV)xGhvl:7ehRxnUw??&Oo
foobarXQxd&sbI0gptQAckGH8xYUZcZ;?i,EdCRJrdJIES"I1e
Training¶
To train this model, we'll calculate the loss at each time step by comparing the predicted char to
the actual char from our label. We can use cross entropy since per char it's similar to a classification problem.
We'll then sum the losses over the sequence and back-propagate the gradients though time.
Notice that the back-propagation algorithm will "visit" each layer's parameter tensors multiple times,
so we'll accumulate gradients in parameters of the blocks. Luckily autograd will handle this part for us.
As usual, the first step of training will be to try and overfit a large model (many parameters) to a tiny dataset. Again, this is to ensure the model and training code are implemented correctly, i.e. that the model can learn.
For a generative model such as this, overfitting is slightly trickier than for classification. What we'll aim to do is to get our model to memorize a specific sequence of chars, so that when given the first char in the sequence it will immediately spit out the rest of the sequence verbatim.
Let's create a tiny dataset to memorize.
# Pick a tiny subset of the dataset
subset_start, subset_end = 1001, 1005
ds_corpus_ss = torch.utils.data.Subset(ds_corpus, range(subset_start, subset_end))
batch_size_ss = 1
sampler_ss = SequenceBatchSampler(ds_corpus_ss, batch_size=batch_size_ss)
dl_corpus_ss = torch.utils.data.DataLoader(ds_corpus_ss, batch_size_ss, sampler=sampler_ss, shuffle=False)
# Convert subset to text
subset_text = ''
for i in range(subset_end - subset_start):
subset_text += unembed(ds_corpus_ss[i][0])
print(f'Text to "memorize":\n\n{subset_text}')
Text to "memorize":
TRAM. What would you have?
HELENA. Something; and scarce so much; nothing, indeed.
I would not tell you what I would, my lord.
Faith, yes:
Strangers and foes do sunder and not kiss.
BERTRAM. I pray you, stay not, but in haste to horse.
HE
Now let's implement the first part of our training code.
TODO: Implement the train_epoch() and train_batch() methods of the RNNTrainer class in the hw3/training.py module.
You must think about how to correctly handle the hidden state of the model between batches and epochs for this specific task (i.e. text generation).
import torch.nn as nn
import torch.optim as optim
from hw3.training import RNNTrainer
torch.manual_seed(42)
lr = 0.01
num_epochs = 500
in_dim = vocab_len
h_dim = 128
n_layers = 2
loss_fn = nn.CrossEntropyLoss()
model = charnn.MultilayerGRU(in_dim, h_dim, out_dim=in_dim, n_layers=n_layers).to(device)
optimizer = optim.Adam(model.parameters(), lr=lr)
trainer = RNNTrainer(model, loss_fn, optimizer, device)
for epoch in range(num_epochs):
epoch_result = trainer.train_epoch(dl_corpus_ss, verbose=False)
# Every X epochs, we'll generate a sequence starting from the first char in the first sequence
# to visualize how/if/what the model is learning.
if epoch == 0 or (epoch+1) % 25 == 0:
avg_loss = np.mean(epoch_result.losses)
accuracy = np.mean(epoch_result.accuracy)
print(f'\nEpoch #{epoch+1}: Avg. loss = {avg_loss:.3f}, Accuracy = {accuracy:.2f}%')
generated_sequence = charnn.generate_from_model(model, subset_text[0],
seq_len*(subset_end-subset_start),
(char_to_idx,idx_to_char), T=0.1)
# Stop if we've successfully memorized the small dataset.
print(generated_sequence)
if generated_sequence == subset_text:
break
# Test successful overfitting
test.assertGreater(epoch_result.accuracy, 99)
test.assertEqual(generated_sequence, subset_text)
Epoch #1: Avg. loss = 3.937, Accuracy = 17.97%
Ttn t t t n t t t t t t tt t t t t t t t
Epoch #25: Avg. loss = 1.034, Accuracy = 73.05%
TAAM. What would you hat youhde?
HELLLAAA. Sotethang; and soat you teanget; and soat nothinge indee and you teinge indee and you teinge inde, note so yoat not sou teinge indee and you teinge soeld not tel, notete.
notete.
notete.
notete.
Epoch #50: Avg. loss = 0.084, Accuracy = 100.00%
TRAM. What would you have?
HELENA. Something; and scarce so much; nothing, indeed.
I would not tell you what I would, my lord.
Faith, yes:
Strangers and foes do sunder and not kiss.
BERTRAM. I pray you, stay not, but in haste to horse.
HE
OK, so training works - we can memorize a short sequence. We'll now train a much larger model on our large dataset. You'll need a GPU for this part.
First, lets set up our dataset and models for training. We'll split our corpus into 90% train and 10% test-set. Also, we'll use a learning-rate scheduler to control the learning rate during training.
TODO: Set the hyperparameters in the part1_rnn_hyperparams() function of the hw3/answers.py module.
from hw3.answers import part1_rnn_hyperparams
hp = part1_rnn_hyperparams()
print('hyperparams:\n', hp)
### Dataset definition
vocab_len = len(char_to_idx)
batch_size = hp['batch_size']
seq_len = hp['seq_len']
train_test_ratio = 0.9
num_samples = (len(corpus) - 1) // seq_len
num_train = int(train_test_ratio * num_samples)
samples, labels = charnn.chars_to_labelled_samples(corpus, char_to_idx, seq_len, device)
ds_train = torch.utils.data.TensorDataset(samples[:num_train], labels[:num_train])
sampler_train = SequenceBatchSampler(ds_train, batch_size)
dl_train = torch.utils.data.DataLoader(ds_train, batch_size, shuffle=False, sampler=sampler_train, drop_last=True)
ds_test = torch.utils.data.TensorDataset(samples[num_train:], labels[num_train:])
sampler_test = SequenceBatchSampler(ds_test, batch_size)
dl_test = torch.utils.data.DataLoader(ds_test, batch_size, shuffle=False, sampler=sampler_test, drop_last=True)
print(f'Train: {len(dl_train):3d} batches, {len(dl_train)*batch_size*seq_len:7d} chars')
print(f'Test: {len(dl_test):3d} batches, {len(dl_test)*batch_size*seq_len:7d} chars')
### Training definition
in_dim = out_dim = vocab_len
checkpoint_file = 'checkpoints/rnn'
num_epochs = 50
early_stopping = 5
model = charnn.MultilayerGRU(in_dim, hp['h_dim'], out_dim, hp['n_layers'], hp['dropout'])
loss_fn = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=hp['learn_rate'])
scheduler = optim.lr_scheduler.ReduceLROnPlateau(
optimizer, mode='max', factor=hp['lr_sched_factor'], patience=hp['lr_sched_patience'], verbose=True
)
trainer = RNNTrainer(model, loss_fn, optimizer, device)
hyperparams:
{'batch_size': 64, 'seq_len': 100, 'h_dim': 512, 'n_layers': 3, 'dropout': 0.3, 'learn_rate': 0.001, 'lr_sched_factor': 0.1, 'lr_sched_patience': 3}
Train: 892 batches, 5708800 chars Test: 99 batches, 633600 chars
/home/avia.avraham/miniconda3/envs/cs236781-hw/lib/python3.8/site-packages/torch/utils/data/sampler.py:65: UserWarning: `data_source` argument is not used and will be removed in 2.2.0.You may still have custom implementation that utilizes it.
warnings.warn("`data_source` argument is not used and will be removed in 2.2.0."
/home/avia.avraham/miniconda3/envs/cs236781-hw/lib/python3.8/site-packages/torch/optim/lr_scheduler.py:60: UserWarning: The verbose parameter is deprecated. Please use get_last_lr() to access the learning rate.
warnings.warn(
The code blocks below will train the model and save checkpoints containing the training state and the best model parameters to a file. This allows you to stop training and resume it later from where you left.
Note that you can use the main.py script provided within the assignment folder to run this notebook from the command line as if it were a python script by using the run-nb subcommand. This allows you to train your model using this notebook without starting jupyter. You can combine this with srun or sbatch to run the notebook with a GPU on the course servers.
TODO:
- Implement the
fit()method of theTrainerclass. You can reuse the relevant implementation parts from HW2, but make sure to implement early stopping and checkpoints. - Implement the
test_epoch()andtest_batch()methods of theRNNTrainerclass in thehw3/training.pymodule. - Run the following block to train.
- When training is done and you're satisfied with the model's outputs, rename the checkpoint file to
checkpoints/rnn_final.pt. This will cause the block to skip training and instead load your saved model when running the homework submission script. Note that your submission zip file will not include the checkpoint file. This is OK.
from cs236781.plot import plot_fit
def post_epoch_fn(epoch, train_res, test_res, verbose):
# Update learning rate
scheduler.step(test_res.accuracy)
# Sample from model to show progress
if verbose:
start_seq = "ACT I."
generated_sequence = charnn.generate_from_model(
model, start_seq, 100, (char_to_idx,idx_to_char), T=0.5
)
print(generated_sequence)
# Train, unless final checkpoint is found
checkpoint_file_final = f'{checkpoint_file}_final.pt'
if os.path.isfile(checkpoint_file_final):
print(f'*** Loading final checkpoint file {checkpoint_file_final} instead of training')
saved_state = torch.load(checkpoint_file_final, map_location=device)
model.load_state_dict(saved_state['model_state'])
else:
try:
# Print pre-training sampling
print(charnn.generate_from_model(model, "ACT I.", 100, (char_to_idx,idx_to_char), T=0.5))
fit_res = trainer.fit(dl_train, dl_test, num_epochs, max_batches=None,
post_epoch_fn=post_epoch_fn, early_stopping=early_stopping,
checkpoints=checkpoint_file, print_every=1)
fig, axes = plot_fit(fit_res)
except KeyboardInterrupt as e:
print('\n *** Training interrupted by user')
*** Loading final checkpoint file checkpoints/rnn_final.pt instead of training
/tmp/ipykernel_3250796/1018569899.py:18: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. saved_state = torch.load(checkpoint_file_final, map_location=device)
Generating a work of art¶
Armed with our fully trained model, let's generate the next Hamlet! You should experiment with modifying the sampling temperature and see what happens.
The text you generate should “look” like a Shakespeare play: old-style English words and sentence structure, directions for the actors (like “Exit/Enter”), sections (Act I/Scene III) etc. There will be no coherent plot of course, but it should at least seem like a Shakespearean play when not looking too closely. If this is not what you see, go back, debug and/or and re-train.
TODO: Specify the generation parameters in the part1_generation_params() function within the hw3/answers.py module.
# delete later: (using it to provide a start sequence):
print(corpus[0:30])
1603 ALLS WELL THAT ENDS WELL
from hw3.answers import part1_generation_params
start_seq, temperature = part1_generation_params()
generated_sequence = charnn.generate_from_model(
model, start_seq, 10000, (char_to_idx,idx_to_char), T=temperature
)
print(generated_sequence)
ACT I. SCENE 1.
PARILLON.
The seas and sorrows have been business
That we have seen them speak and do their beauty.
The man is so in the miseries of the world
The better than the state of love will strike
The sun of fair and virtue. There is no more
Than they will harm the seas and strength and fear.
The self-same spirit, and the good world that should
Be sure to stand by this the grave of heaven,
And be a creature of the world to do
The strength of many men of sorrow straight
That will not stand as this in the rest of them.
The first that hath a point of these shall be,
And therefore there the season shall be sent
To th' state of manners, and the profferer shame
That we may stand and wear the streets that we
That we have strongly straight against the streets.
The market-place is dead, and then the state
Is false to the sea-side of Caesar's wife.
I will not be the subjects of the streets,
And the contentious part of this discourse
That we have seen them sent to see them do.
The field is false, and therefore there the stroke
That would be seen to see the thing I do.
The common princely gods that stand at home,
The service of the people are all the world,
And still at first they are a proper man.
There is no more than that, the greatest store
Which I shall see the wars of the world's love.
The more I have a soldier that he came
To the desire of all the streets of heaven
And send him to the sun that hath a child
That ever shall be set upon his face.
And so the shame will seem to see the world.
The sea was here that hath a strange device
That hath been strangely to the greatest son.
The gods be stain'd to th' common eye of marks,
The spirit of a fair desire to see
The sun and passage of the state of life,
And they will go to th' mouth of the stronger traitor.
The worst is this a stranger to the state of Cherss.
Exeunt.
SCENE II.
A common the Greatest servant.
Enter Troyan and Cornelius.
What say'st thou, sir?
CASSIO. I will not be the most perfection of your
master.
OTHELLO. Exit
POSTHUMUS. I am a maid no more to see your house.
The law of the world is the stronger part
That shall be so assur'd that they did stand.
The world is call'd the court, and they are so ended.
This is the last of him.
LUCIUS. The greatest service that will be the side
That I have seen the man that hath deserves me.
I am a subject of this storm of war,
And therefore the content of such a sight
Is all the sum of the morning.
POLIXENES. I have seen them bear them.
The captain of the world is there a strain
That cannot buy my father's son and heart.
If that the water stand upon my soul,
I cannot see the world to make thee speak.
Thou shalt not speak to me where I have done,
That will I see thy threatres to thy shame.
Thou art a great remembrance of my soul,
And thou art thought to be a princely seat.
Thou art a man of men. There's for thy state
That thou shalt have a state of merits there.
I shall be thought on thee. Thou art a court
And make thee strive to be a dear accuser.
I do not stay the world in him that works
The stars of the desire of thee. The people will not be
As thou art here as shadow on the sea,
And with the state of thine shall be a soldier.
Therefore, good night, good night. Exeunt
ACT II. SCENE 1.
The palace
Enter the BASTARD, and SOLDIERS
ARMADO. Thou shalt not see him as thou didst bear thee.
I will be so assur'd that I have done.
I am not well to be a shame to do.
I thank thee, gentle Lord of York, and Grace
To the senate of the field of England's wife.
The several wars are sure of them that we have stol'n
The state of person, which they do not lose.
The season shall be seen the court of France.
The Duke of Lancaster will be the house.
The first of the distress of the wild way
Is that the Duke of Buckingham and him
And the dead courage of the court of England
Is not the rest of the devil to hear the sea.
The prince of England shall be said to hear
The shadow of the state of heaven and hell
And bring the devil of the country of him.
The King is come to play a park of this.
The sea was made to be a strange offence,
And then the King hath been the breath of war,
That he hath been the business of his face.
The King hath heard him speak a poor and fair.
And then I saw him still and fear to do't.
And when I was a soldier to the dead,
I will not be a king of his own seat.
If he be so, I shall not see your son.
I will not love you to be seen and see
The fairest of the world.
KING. I will be gone to see the King.
BEROWNE. The King is not so fast, and there is such a prince
That he shall see the man that hath been close.
The Duke of Norfolk stands in his desires,
And then he stands a pair of father's son,
The father of his son and such a side
The strange of them as well as they do here.
The blood of this the world will be a strange
That he hath strangely strike the state of heaven.
I have a dear lord of the Duke of York,
And for the day of heaven and his deserts
Will stand and bear the like and worth the field.
The King hath heard the constant course of griefs
That with the state of such a state hath spoke
That they will strike him in the streets of him.
The King is come to see him send to me.
I am a prince's father, and the King
Shall be the stars of this that we may say.
The sweetest son of Buckingham is come
To see the first contempt of all the world.
The faithful son of England is a prince,
And strike him of his blood and scarce a friend.
The devil was the worst in her best company.
The sense of this distraction is the sea,
And there are strong and strong and strong and string
That we have seen them bear the stronger state.
The truth is come to see him that she should not
Be content to be converted with the King.
The King is not so far as I can do.
The Duke of Buckingham and the King is so far from him.
Therefore be gone and well become a fairer.
But what a worse that hath deserved thee,
That thou shalt be a man of soldiers,
And straight and want the sun to be a death,
And the dear man shall be the strength of this.
The trumpets sound and die with thy desires.
This is the man that stands upon the field.
This is the father of the Duke of Gloucester,
And he is but a parley of the state.
But when the sun were not to be a wife,
And so I say, the man is not a flood,
And therefore the wild heart will be a form.
The souls of more than they shall see the greatest
That we have seen the special state of war.
The sun shall be the stars of his departure.
The like the dead of war will be the world.
And what is this that thou hast spoke of me?
The more that I have seen thee as the field,
And the within the sea was seen and dreams.
Thou shalt not stay the crown of England's son.
The King is not a fair proceeding of
The strength of mine own part of mine own words.
And therefore then thou hast not seen thee straight.
I think thou art a sorry that I should,
And I will win thee from thy soul to me.
I will not stay the course of thine to death.
The time is so reported to be said.
The first in all the rest that I have seen,
And therefore shall I speak the stronger than
That will be seen to start and win the state,
And then thou shalt be so against the time
That shall be satisfied. Therefore thy master shall
Be stranger to a man.
Rom. I will not speak of this.
I will not see the season of the world.
If thou didst bear the common fortune of thy breath,
I will not be the world a word of thine.
The more the soul of man that stands alone,
And thou hast seen the sea and the man that dies.
The like and all the state of them are so.
The father doth revenge it to the state,
Which thou art forc'd to see thee all together,
Thou shalt not have thy heart with servants too.
The King is dead, and the proud of the world
I shall be so absent as thou art dead.
The courtesy that thou shalt have a strange
That thou shalt see me with a fair and man.
The country man and thy desire and soul
Is deadly by the search of all the world
That thou art proud to see the common time.
Thou shalt be craven at the day of death,
And the remembrance of thy soul thou lovest.
The truth is so, and therefore shall I stay
The sea of this another deed of mine.
The season shall be paid the common people
Of the substance of the world and day.
The seas of thine is come to supportance.
The sun is but a man of such a face,
And with the street was the wind of the state
Of the wind spirit of the contrary day,
And the revolt of this deserving strength
That stands upon the stage. The King his sword
Shall be the seas of mine own spirits and laws
Of the dead men of heaven, and the which they
Shall be the seas of all the streets of France.
The party that the world will see the fierce.
And therefore, for thy life is this thy father,
And I will stand the mighty son of heaven.
I have not standed there to be a mortal,
And therefore thou art straight and take thy life.
The seas of this the first of thy free place
I
Questions¶
TODO Answer the following questions. Write your answers in the appropriate variables in the module hw3/answers.py.
from cs236781.answers import display_answer
import hw3.answers
Question 1¶
Why do we split the corpus into sequences instead of training on the whole text?
display_answer(hw3.answers.part1_q1)
We split the corpus into sequences because if we train for the whole text, we'll have to keep a lot of memory for long text (to save all the hidden-states in order to use later in the backpropagation), as studied in class. So by splitting it to chunks, we can run forward and backward through chunks of sequences, and thus only need to keep the hidden-states of the last layer in each chunk. Also, splitting to chunks allows to speed up training using parallelism.
Question 2¶
How is it possible that the generated text clearly shows memory longer than the sequence length?
display_answer(hw3.answers.part1_q2)
The model's memory is longer than the sequence length since the hidden state are carried over between sequences and so they serve as a sort of long-term memory that is maintained throughout the entire training text. Also, the GRU architecture is built to allow retaining important information over longer periods and forgetting irrelevant information.
Question 3¶
Why are we not shuffling the order of batches when training?
display_answer(hw3.answers.part1_q3)
We don't shuffle the order of batches when training because if we did, we'll lose important information regarding the position of that sequence in the text. We don't actually want to refer to each such sequence as an independant sample, but we want to preserve their order to be able to learn their sequential dependencies. (The importance of the order between batches is reflected in the value of the hidden state, which retains information about previous sequences and influences the prediction of subsequent ones.)
Question 4¶
- Why do we lower the temperature for sampling (compared to the default of $1.0$)?
- What happens when the temperature is very high and why?
- What happens when the temperature is very low and why?
display_answer(hw3.answers.part1_q4)
- When sampling, we would prefer to control the distributions and make them less uniform to increase the chance of sampling the char(s) with the highest scores compared to the others.
- When the temperature is very high, the softmax which takes the scores and returns probablities, will return probabilities that are very close to uniform distribution, meaning the predictions of the models are similar to randomly picking a char from the vocabulary each time, as if we didn't learn anything from the training data.
- When the temperature is very low, the softmax will exaggerate the differences between the different scores, i.e sharpen the probability distribution, making the model highly confident about the most likely characters while practically ignoring the ones with the lower scores. So lower number of characters likely to be predicted, but these are predicted with higher probability.
$$ \newcommand{\mat}[1]{\boldsymbol {#1}} \newcommand{\mattr}[1]{\boldsymbol {#1}^\top} \newcommand{\matinv}[1]{\boldsymbol {#1}^{-1}} \newcommand{\vec}[1]{\boldsymbol {#1}} \newcommand{\vectr}[1]{\boldsymbol {#1}^\top} \newcommand{\rvar}[1]{\mathrm {#1}} \newcommand{\rvec}[1]{\boldsymbol{\mathrm{#1}}} \newcommand{\diag}{\mathop{\mathrm {diag}}} \newcommand{\set}[1]{\mathbb {#1}} \newcommand{\norm}[1]{\left\lVert#1\right\rVert} \newcommand{\pderiv}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\bm}[1]{{\bf #1}} \newcommand{\bb}[1]{\bm{\mathrm{#1}}} $$
Part 2: Variational Autoencoder¶
In this part we will learn to generate new data using a special type of autoencoder model which allows us to sample from its latent space. We'll implement and train a VAE and use it to generate new images.
import unittest
import os
import sys
import pathlib
import urllib
import shutil
import re
import zipfile
import numpy as np
import torch
import matplotlib.pyplot as plt
%load_ext autoreload
%autoreload 2
test = unittest.TestCase()
plt.rcParams.update({'font.size': 12})
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('Using device:', device)
Using device: cuda
Obtaining the dataset¶
Let's begin by downloading a dataset of images that we want to learn to generate. We'll use the Labeled Faces in the Wild (LFW) dataset which contains many labeled faces of famous individuals.
We're going to train our generative model to generate a specific face, not just any face. Since the person with the most images in this dataset is former president George W. Bush, we'll set out to train a Bush Generator :)
However, if you feel adventurous and/or prefer to generate something else, feel free to edit the PART2_CUSTOM_DATA_URL variable in hw3/answers.py.
import cs236781.plot as plot
import cs236781.download
from hw3.answers import PART2_CUSTOM_DATA_URL as CUSTOM_DATA_URL
DATA_DIR = pathlib.Path.home().joinpath('.pytorch-datasets')
if CUSTOM_DATA_URL is None:
DATA_URL = 'http://vis-www.cs.umass.edu/lfw/lfw-bush.zip'
else:
DATA_URL = CUSTOM_DATA_URL
_, dataset_dir = cs236781.download.download_data(out_path=DATA_DIR, url=DATA_URL, extract=True, force=False)
File /home/avia.avraham/.pytorch-datasets/George_W_Bush2.zip exists, skipping download. Extracting /home/avia.avraham/.pytorch-datasets/George_W_Bush2.zip...
Extracted 532 to /home/avia.avraham/.pytorch-datasets/George_W_Bush/George_W_Bush
Create a Dataset object that will load the extraced images:
import torchvision.transforms as T
from torchvision.datasets import ImageFolder
im_size = 64
tf = T.Compose([
# Resize to constant spatial dimensions
T.Resize((im_size, im_size)),
# PIL.Image -> torch.Tensor
T.ToTensor(),
# Dynamic range [0,1] -> [-1, 1]
T.Normalize(mean=(.5,.5,.5), std=(.5,.5,.5)),
])
ds_gwb = ImageFolder(os.path.dirname(dataset_dir), tf)
OK, let's see what we got. You can run the following block multiple times to display a random subset of images from the dataset.
_ = plot.dataset_first_n(ds_gwb, 50, figsize=(15,10), nrows=5)
print(f'Found {len(ds_gwb)} images in dataset folder.')
Found 530 images in dataset folder.
x0, y0 = ds_gwb[0]
x0 = x0.unsqueeze(0).to(device)
print(x0.shape)
test.assertSequenceEqual(x0.shape, (1, 3, im_size, im_size))
torch.Size([1, 3, 64, 64])
The Variational Autoencoder¶
An autoencoder is a model which learns a representation of data in an unsupervised fashion (i.e without any labels). Recall it's general form from the lecture:

An autoencoder maps an instance $\bb{x}$ to a latent-space representation $\bb{z}$. It has an encoder part, $\Phi_{\bb{\alpha}}(\bb{x})$ (a model with parameters $\bb{\alpha}$) and a decoder part, $\Psi_{\bb{\beta}}(\bb{z})$ (a model with parameters $\bb{\beta}$).
While autoencoders can learn useful representations, generally it's hard to use them as generative models because there's no distribution we can sample from in the latent space. In other words, we have no way to choose a point $\bb{z}$ in the latent space such that $\Psi(\bb{z})$ will end up on the data manifold in the instance space.

The variational autoencoder (VAE), first proposed by Kingma and Welling, addresses this issue by taking a probabilistic perspective. Briefly, a VAE model can be described as follows.
We define, in Baysean terminology,
- The prior distribution $p(\bb{Z})$ on points in the latent space.
- The posterior distribution of points in the latent spaces given a specific instance: $p(\bb{Z}|\bb{X})$.
- The likelihood distribution of a sample $\bb{X}$ given a latent-space representation: $p(\bb{X}|\bb{Z})$.
- The evidence distribution $p(\bb{X})$ which is the distribution of the instance space due to the generative process.
To create our variational decoder we'll further specify:
- A parametric likelihood distribution, $p _{\bb{\beta}}(\bb{X} | \bb{Z}=\bb{z}) = \mathcal{N}( \Psi _{\bb{\beta}}(\bb{z}) , \sigma^2 \bb{I} )$. The interpretation is that given a latent $\bb{z}$, we map it to a point normally distributed around the point calculated by our decoder neural network. Note that here $\sigma^2$ is a hyperparameter while $\vec{\beta}$ represents the network parameters.
- A fixed latent-space prior distribution of $p(\bb{Z}) = \mathcal{N}(\bb{0},\bb{I})$.
This setting allows us to generate a new instance $\bb{x}$ by sampling $\bb{z}$ from the multivariate normal distribution, obtaining the instance-space mean $\Psi _{\bb{\beta}}(\bb{z})$ using our decoder network, and then sampling $\bb{x}$ from $\mathcal{N}( \Psi _{\bb{\beta}}(\bb{z}) , \sigma^2 \bb{I} )$.
Our variational encoder will approximate the posterior with a parametric distribution $q _{\bb{\alpha}}(\bb{Z} | \bb{x}) = \mathcal{N}( \bb{\mu} _{\bb{\alpha}}(\bb{x}), \mathrm{diag}\{ \bb{\sigma}^2_{\bb{\alpha}}(\bb{x}) \} )$. The interpretation is that our encoder model, $\Phi_{\vec{\alpha}}(\bb{x})$, calculates the mean and variance of the posterior distribution, and samples $\bb{z}$ based on them. An important nuance here is that our network can't contain any stochastic elements that depend on the model parameters, otherwise we won't be able to back-propagate to those parameters. So sampling $\bb{z}$ from $\mathcal{N}( \bb{\mu} _{\bb{\alpha}}(\bb{x}), \mathrm{diag}\{ \bb{\sigma}^2_{\bb{\alpha}}(\bb{x}) \} )$ is not an option. The solution is to use what's known as the reparametrization trick: sample from an isotropic Gaussian, i.e. $\bb{u}\sim\mathcal{N}(\bb{0},\bb{I})$ (which doesn't depend on trainable parameters), and calculate the latent representation as $\bb{z} = \bb{\mu} _{\bb{\alpha}}(\bb{x}) + \bb{u}\odot\bb{\sigma}_{\bb{\alpha}}(\bb{x})$.
To train a VAE model, we maximize the evidence distribution, $p(\bb{X})$ (see question below). The VAE loss can therefore be stated as minimizing $\mathcal{L} = -\mathbb{E}_{\bb{x}} \log p(\bb{X})$. Although this expectation is intractable, we can obtain a lower-bound for $p(\bb{X})$ (the evidence lower bound, "ELBO", shown in the lecture):
$$ \log p(\bb{X}) \ge \mathbb{E} _{\bb{z} \sim q _{\bb{\alpha}} }\left[ \log p _{\bb{\beta}}(\bb{X} | \bb{z}) \right] - \mathcal{D} _{\mathrm{KL}}\left(q _{\bb{\alpha}}(\bb{Z} | \bb{X})\,\left\|\, p(\bb{Z} )\right.\right) $$
where $ \mathcal{D} _{\mathrm{KL}}(q\left\|\right.p) = \mathbb{E}_{\bb{z}\sim q}\left[ \log \frac{q(\bb{Z})}{p(\bb{Z})} \right] $ is the Kullback-Liebler divergence, which can be interpreted as the information gained by using the posterior $q(\bb{Z|X})$ instead of the prior distribution $p(\bb{Z})$.
Using the ELBO, the VAE loss becomes, $$ \mathcal{L}(\vec{\alpha},\vec{\beta}) = \mathbb{E} _{\bb{x}} \left[ \mathbb{E} _{\bb{z} \sim q _{\bb{\alpha}} }\left[ -\log p _{\bb{\beta}}(\bb{x} | \bb{z}) \right] + \mathcal{D} _{\mathrm{KL}}\left(q _{\bb{\alpha}}(\bb{Z} | \bb{x})\,\left\|\, p(\bb{Z} )\right.\right) \right]. $$
By remembering that the likelihood is a Gaussian distribution with a diagonal covariance and by applying the reparametrization trick, we can write the above as
$$ \mathcal{L}(\vec{\alpha},\vec{\beta}) = \mathbb{E} _{\bb{x}} \left[ \mathbb{E} _{\bb{z} \sim q _{\bb{\alpha}} } \left[ \frac{1}{2\sigma^2}\left\| \bb{x}- \Psi _{\bb{\beta}}\left( \bb{\mu} _{\bb{\alpha}}(\bb{x}) + \bb{\Sigma}^{\frac{1}{2}} _{\bb{\alpha}}(\bb{x}) \bb{u} \right) \right\| _2^2 \right] + \mathcal{D} _{\mathrm{KL}}\left(q _{\bb{\alpha}}(\bb{Z} | \bb{x})\,\left\|\, p(\bb{Z} )\right.\right) \right]. $$
Model Implementation¶
Obviously our model will have two parts, an encoder and a decoder. Since we're working with images, we'll implement both as deep convolutional networks, where the decoder is a "mirror image" of the encoder implemented with adjoint (AKA transposed) convolutions. Between the encoder CNN and the decoder CNN we'll implement the sampling from the parametric posterior approximator $q_{\bb{\alpha}}(\bb{Z}|\bb{x})$ to make it a VAE model and not just a regular autoencoder (of course, this is not yet enough to create a VAE, since we also need a special loss function which we'll get to later).
First let's implement just the CNN part of the Encoder network (this is not the full $\Phi_{\vec{\alpha}}(\bb{x})$ yet). As usual, it should take an input image and map to a activation volume of a specified depth. We'll consider this volume as the features we extract from the input image. Later we'll use these to create the latent space representation of the input.
import hw3.autoencoder as autoencoder
in_channels = 3
out_channels = 1024
encoder_cnn = autoencoder.EncoderCNN(in_channels, out_channels).to(device)
print(encoder_cnn)
h = encoder_cnn(x0)
print(h.shape)
test.assertEqual(h.dim(), 4)
test.assertSequenceEqual(h.shape[0:2], (1, out_channels))
EncoderCNN(
(cnn): Sequential(
(0): Conv2d(3, 256, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2))
(1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU()
(3): Conv2d(256, 192, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2))
(4): BatchNorm2d(192, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): ReLU()
(6): Conv2d(192, 128, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2))
(7): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(8): ReLU()
(9): Conv2d(128, 96, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(10): BatchNorm2d(96, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(11): ReLU()
(12): Conv2d(96, 1024, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(13): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(14): ReLU()
)
)
torch.Size([1, 1024, 8, 8])
Now let's implement the CNN part of the Decoder.
Again this is not yet the full $\Psi _{\bb{\beta}}(\bb{z})$. It should take an activation volume produced
by your EncoderCNN and output an image of the same dimensions as the Encoder's input was.
This can be a CNN which is like a "mirror image" of the the Encoder. For example, replace convolutions with transposed convolutions, downsampling with up-sampling etc.
Consult the documentation of ConvTranspose2D
to figure out how to reverse your convolutional layers in terms of input and output dimensions. Note that the decoder doesn't have to be exactly the opposite of the encoder and you can experiment with using a different architecture.
TODO: Implement the DecoderCNN class in the hw3/autoencoder.py module.
decoder_cnn = autoencoder.DecoderCNN(in_channels=out_channels, out_channels=in_channels).to(device)
print(decoder_cnn)
x0r = decoder_cnn(h)
print(x0r.shape)
test.assertEqual(x0.shape, x0r.shape)
# Should look like colored noise
T.functional.to_pil_image(x0r[0].cpu().detach())
DecoderCNN(
(cnn): Sequential(
(0): ConvTranspose2d(1024, 96, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(1): BatchNorm2d(96, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU()
(3): ConvTranspose2d(96, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(4): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): ReLU()
(6): ConvTranspose2d(128, 192, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2), output_padding=(1, 1), bias=False)
(7): BatchNorm2d(192, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(8): ReLU()
(9): ConvTranspose2d(192, 256, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2), output_padding=(1, 1), bias=False)
(10): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(11): ReLU()
(12): ConvTranspose2d(256, 3, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2), output_padding=(1, 1), bias=False)
(13): BatchNorm2d(3, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
torch.Size([1, 3, 64, 64])
Let's now implement the full VAE Encoder, $\Phi_{\vec{\alpha}}(\vec{x})$. It will work as follows:
- Produce a feature vector $\vec{h}$ from the input image $\vec{x}$.
- Use two affine transforms to convert the features into the mean and log-variance of the posterior, i.e. $$ \begin{align} \bb{\mu} _{\bb{\alpha}}(\bb{x}) &= \vec{h}\mattr{W}_{\mathrm{h\mu}} + \vec{b}_{\mathrm{h\mu}} \\ \log\left(\bb{\sigma}^2_{\bb{\alpha}}(\bb{x})\right) &= \vec{h}\mattr{W}_{\mathrm{h\sigma^2}} + \vec{b}_{\mathrm{h\sigma^2}} \end{align} $$
- Use the reparametrization trick to create the latent representation $\vec{z}$.
Notice that we model the log of the variance, not the actual variance. The above formulation is proposed in appendix C of the VAE paper.
TODO: Implement the encode() method in the VAE class within the hw3/autoencoder.py module.
You'll also need to define your parameters in __init__().
z_dim = 2
vae = autoencoder.VAE(encoder_cnn, decoder_cnn, x0[0].size(), z_dim).to(device)
print(vae)
z, mu, log_sigma2 = vae.encode(x0)
test.assertSequenceEqual(z.shape, (1, z_dim))
test.assertTrue(z.shape == mu.shape == log_sigma2.shape)
print(f'mu(x0)={list(*mu.detach().cpu().numpy())}, sigma2(x0)={list(*torch.exp(log_sigma2).detach().cpu().numpy())}')
VAE(
(features_encoder): EncoderCNN(
(cnn): Sequential(
(0): Conv2d(3, 256, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2))
(1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU()
(3): Conv2d(256, 192, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2))
(4): BatchNorm2d(192, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): ReLU()
(6): Conv2d(192, 128, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2))
(7): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(8): ReLU()
(9): Conv2d(128, 96, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(10): BatchNorm2d(96, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(11): ReLU()
(12): Conv2d(96, 1024, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(13): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(14): ReLU()
)
)
(features_decoder): DecoderCNN(
(cnn): Sequential(
(0): ConvTranspose2d(1024, 96, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(1): BatchNorm2d(96, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU()
(3): ConvTranspose2d(96, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(4): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): ReLU()
(6): ConvTranspose2d(128, 192, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2), output_padding=(1, 1), bias=False)
(7): BatchNorm2d(192, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(8): ReLU()
(9): ConvTranspose2d(192, 256, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2), output_padding=(1, 1), bias=False)
(10): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(11): ReLU()
(12): ConvTranspose2d(256, 3, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2), output_padding=(1, 1), bias=False)
(13): BatchNorm2d(3, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(z): Linear(in_features=2, out_features=65536, bias=True)
(mu): Linear(in_features=65536, out_features=2, bias=True)
(log_sigma2): Linear(in_features=65536, out_features=2, bias=True)
)
mu(x0)=[-0.118786275, 0.038355697], sigma2(x0)=[0.7098376, 0.8528108]
Let's sample some 2d latent representations for an input image x0 and visualize them.
# Sample from q(Z|x)
N = 500
Z = torch.zeros(N, z_dim)
_, ax = plt.subplots()
with torch.no_grad():
for i in range(N):
Z[i], _, _ = vae.encode(x0)
ax.scatter(*Z[i].cpu().numpy())
# Should be close to the mu/sigma in the previous block above
print('sampled mu', torch.mean(Z, dim=0))
print('sampled sigma2', torch.var(Z, dim=0))
sampled mu tensor([-0.1277, 0.0046]) sampled sigma2 tensor([0.7080, 0.8574])
Let's now implement the full VAE Decoder, $\Psi _{\bb{\beta}}(\bb{z})$. It will work as follows:
- Produce a feature vector $\tilde{\vec{h}}$ from the latent vector $\vec{z}$ using an affine transform.
- Reconstruct an image $\tilde{\vec{x}}$ from $\tilde{\vec{h}}$ using the decoder CNN.
TODO: Implement the decode() method in the VAE class within the hw3/autoencoder.py module.
You'll also need to define your parameters in __init__(). You may need to also re-run the block above after you implement this.
x0r = vae.decode(z)
test.assertSequenceEqual(x0r.shape, x0.shape)
Our model's forward() function will simply return decode(encode(x)) as well as the calculated mean and log-variance of the posterior.
x0r, mu, log_sigma2 = vae(x0)
test.assertSequenceEqual(x0r.shape, x0.shape)
test.assertSequenceEqual(mu.shape, (1, z_dim))
test.assertSequenceEqual(log_sigma2.shape, (1, z_dim))
T.functional.to_pil_image(x0r[0].detach().cpu())
Loss Implementation¶
In practice, since we're using SGD, we'll drop the expectation over $\bb{X}$ and instead sample an instance from the training set and compute a point-wise loss. Similarly, we'll drop the expectation over $\bb{Z}$ by sampling from $q_{\vec{\alpha}}(\bb{Z}|\bb{x})$. Additionally, because the KL divergence is between two Gaussian distributions, there is a closed-form expression for it. These points bring us to the following point-wise loss:
$$ \ell(\vec{\alpha},\vec{\beta};\bb{x}) = \frac{1}{\sigma^2 d_x} \left\| \bb{x}- \Psi _{\bb{\beta}}\left( \bb{\mu} _{\bb{\alpha}}(\bb{x}) + \bb{\Sigma}^{\frac{1}{2}} _{\bb{\alpha}}(\bb{x}) \bb{u} \right) \right\| _2^2 + \mathrm{tr}\,\bb{\Sigma} _{\bb{\alpha}}(\bb{x}) + \|\bb{\mu} _{\bb{\alpha}}(\bb{x})\|^2 _2 - d_z - \log\det \bb{\Sigma} _{\bb{\alpha}}(\bb{x}), $$
where $d_z$ is the dimension of the latent space, $d_x$ is the dimension of the input and $\bb{u}\sim\mathcal{N}(\bb{0},\bb{I})$. This pointwise loss is the quantity that we'll compute and minimize with gradient descent. The first term corresponds to the data-reconstruction loss, while the second term corresponds to the KL-divergence loss. Note that the scaling by $d_x$ is not derived from the original loss formula and was added directly to the pointwise loss just to normalize the data term.
TODO: Implement the vae_loss() function in the hw3/autoencoder.py module.
from hw3.autoencoder import vae_loss
torch.manual_seed(42)
def test_vae_loss():
# Test data
N, C, H, W = 10, 3, 64, 64
z_dim = 32
x = torch.randn(N, C, H, W)*2 - 1
xr = torch.randn(N, C, H, W)*2 - 1
z_mu = torch.randn(N, z_dim)
z_log_sigma2 = torch.randn(N, z_dim)
x_sigma2 = 0.9
loss, _, _ = vae_loss(x, xr, z_mu, z_log_sigma2, x_sigma2)
test.assertAlmostEqual(loss.item(), 58.3234367, delta=1e-3)
return loss
test_vae_loss()
tensor(58.3234)
Sampling¶
The main advantage of a VAE is that it can by used as a generative model by sampling the latent space, since we optimize for a isotropic Gaussian prior $p(\bb{Z})$ in the loss function. Let's now implement this so that we can visualize how our model is doing when we train.
TODO: Implement the sample() method in the VAE class within the hw3/autoencoder.py module.
samples = vae.sample(5)
_ = plot.tensors_as_images(samples)
Training¶
Time to train!
TODO:
- Implement the
VAETrainerclass in thehw3/training.pymodule. Make sure to implement thecheckpointsfeature of theTrainerclass if you haven't done so already in Part 1. - Tweak the hyperparameters in the
part2_vae_hyperparams()function within thehw3/answers.pymodule.
import torch.optim as optim
from torch.utils.data import random_split
from torch.utils.data import DataLoader
from torch.nn import DataParallel
from hw3.training import VAETrainer
from hw3.answers import part2_vae_hyperparams
torch.manual_seed(42)
# Hyperparams
hp = part2_vae_hyperparams()
batch_size = hp['batch_size']
h_dim = hp['h_dim']
z_dim = hp['z_dim']
x_sigma2 = hp['x_sigma2']
learn_rate = hp['learn_rate']
betas = hp['betas']
# Data
split_lengths = [int(len(ds_gwb)*0.9), int(len(ds_gwb)*0.1)]
ds_train, ds_test = random_split(ds_gwb, split_lengths)
dl_train = DataLoader(ds_train, batch_size, shuffle=True)
dl_test = DataLoader(ds_test, batch_size, shuffle=True)
im_size = ds_train[0][0].shape
# Model
encoder = autoencoder.EncoderCNN(in_channels=im_size[0], out_channels=h_dim)
decoder = autoencoder.DecoderCNN(in_channels=h_dim, out_channels=im_size[0])
vae = autoencoder.VAE(encoder, decoder, im_size, z_dim)
vae_dp = DataParallel(vae).to(device)
# Optimizer
optimizer = optim.Adam(vae.parameters(), lr=learn_rate, betas=betas)
# Loss
def loss_fn(x, xr, z_mu, z_log_sigma2):
return autoencoder.vae_loss(x, xr, z_mu, z_log_sigma2, x_sigma2)
# Trainer
trainer = VAETrainer(vae_dp, loss_fn, optimizer, device)
checkpoint_file = 'checkpoints/vae'
checkpoint_file_final = f'{checkpoint_file}_final'
if os.path.isfile(f'{checkpoint_file}.pt'):
os.remove(f'{checkpoint_file}.pt')
# Show model and hypers
print(vae)
print(hp)
VAE(
(features_encoder): EncoderCNN(
(cnn): Sequential(
(0): Conv2d(3, 256, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2))
(1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU()
(3): Conv2d(256, 192, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2))
(4): BatchNorm2d(192, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): ReLU()
(6): Conv2d(192, 128, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2))
(7): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(8): ReLU()
(9): Conv2d(128, 96, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(10): BatchNorm2d(96, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(11): ReLU()
(12): Conv2d(96, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(13): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(14): ReLU()
)
)
(features_decoder): DecoderCNN(
(cnn): Sequential(
(0): ConvTranspose2d(256, 96, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(1): BatchNorm2d(96, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU()
(3): ConvTranspose2d(96, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(4): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): ReLU()
(6): ConvTranspose2d(128, 192, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2), output_padding=(1, 1), bias=False)
(7): BatchNorm2d(192, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(8): ReLU()
(9): ConvTranspose2d(192, 256, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2), output_padding=(1, 1), bias=False)
(10): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(11): ReLU()
(12): ConvTranspose2d(256, 3, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2), output_padding=(1, 1), bias=False)
(13): BatchNorm2d(3, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(z): Linear(in_features=128, out_features=16384, bias=True)
(mu): Linear(in_features=16384, out_features=128, bias=True)
(log_sigma2): Linear(in_features=16384, out_features=128, bias=True)
)
{'batch_size': 32, 'h_dim': 256, 'z_dim': 128, 'x_sigma2': 0.0023, 'learn_rate': 0.00019, 'betas': (0.9, 0.999)}
TODO:
- Run the following block to train. It will sample some images from your model every few epochs so you can see the progress.
- When you're satisfied with your results, rename the checkpoints file by adding
_final. When you run themain.pyscript to generate your submission, the final checkpoints file will be loaded instead of running training. Note that your final submission zip will not include thecheckpoints/folder. This is OK.
The images you get should be colorful, with different backgrounds and poses.
import IPython.display
def post_epoch_fn(epoch, train_result, test_result, verbose):
# Plot some samples if this is a verbose epoch
if verbose:
samples = vae.sample(n=5)
fig, _ = plot.tensors_as_images(samples, figsize=(6,2))
IPython.display.display(fig)
plt.close(fig)
if os.path.isfile(f'{checkpoint_file_final}.pt'):
print(f'*** Loading final checkpoint file {checkpoint_file_final} instead of training')
checkpoint_file = checkpoint_file_final
else:
res = trainer.fit(dl_train, dl_test,
num_epochs=200, early_stopping=20, print_every=10,
checkpoints=checkpoint_file,
post_epoch_fn=post_epoch_fn)
# Plot images from best model
saved_state = torch.load(f'{checkpoint_file}.pt', map_location=device)
vae_dp.load_state_dict(saved_state['model_state'])
print('*** Images Generated from best model:')
fig, _ = plot.tensors_as_images(vae_dp.module.sample(n=15), nrows=3, figsize=(6,6))
*** Loading final checkpoint file checkpoints/vae_final instead of training
/tmp/ipykernel_3251005/3902647951.py:21: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
saved_state = torch.load(f'{checkpoint_file}.pt', map_location=device)
*** Images Generated from best model:
Questions¶
TODO Answer the following questions. Write your answers in the appropriate variables in the module hw3/answers.py.
from cs236781.answers import display_answer
import hw3.answers
Question 1¶
What does the $\sigma^2$ hyperparameter (x_sigma2 in the code) do? Explain the effect of low and high values.
display_answer(hw3.answers.part2_q1)
Your answer: Looking at the loss function of the VAE, we can see that the reconstruction loss is the sum of the squared differences between the input and the output of the decoder, divided by the variance of the input: the $\sigma^2$ hyperparameter. This means that when $\sigma^2$ is low, the VAE is pushed to reconstruct the input more precisely, as is more heacily pentalize high differences. In practice this cal also lead to overfitting, since the model try to memorize the input rateher then learn more general latent features. On the other hand, when $\sigma^2$ is high, the VAE is less strict about the reconstruction, which may allow the model to learn more robust, higher level features, but can also lead to underfitting.
Question 2¶
- Explain the purpose of both parts of the VAE loss term - reconstruction loss and KL divergence loss.
- How is the latent-space distribution affected by the KL loss term?
- What's the benefit of this effect?
display_answer(hw3.answers.part2_q2)
Your answer: push the latent space to also be a standard normal distribution.
The reconstruction loss's purpose is to ensure that the model can reconstruct the input data well from the latent space. The KL divergence loss's purpose is to ensure that the latent space distribution is close to the prior distribution.
When the KL loss term is low, the latent space distribution is close to the prior distribution. So for example, if the prior distribution is a standard normal distribution, the KL loss will
- When the latent space distribution is close to the prior distribution, it ensures the latent space is smooth and continuous like the prior distribution, making it possible to sample meaningful latent variables during generation. In addition, it also prevents overfitting by encoding into a more generalizable latent space, and it allows thes interpolation and generation of new data points, because the latent space is similar to the prior distrubution.
Question 3¶
In the formulation of the VAE loss, why do we start by maximizing the evidence distribution, $p(X)$?
display_answer(hw3.answers.part2_q3)
Your answer: We maximize the evidence distribution $p(X)$ because in generative modeling we want our model to assign high probability to the observed data. Intuitively, the better the model “explains” or “fits” the data, the larger $p(X)$ becomes. Hence, finding parameters that maximize $p(X)$ yields a model that best represents or generates the data we observe. (In practice we maximize an approximation of the evidence distribution (the ELBO), since the true evidence distribution is intractable to compute.)
Question 4¶
In the VAE encoder, why do we model the log of the latent-space variance corresponding to an input, $\sigma^2_{\alpha}$, instead of directly modelling this variance?
display_answer(hw3.answers.part2_q4)
Your answer: We let the encoder output the log of the latent-space variance, i.e. $\log \sigma_{\alpha}^2$, instead of $\sigma_{\alpha}^2$ directly in order to ensure that the resulting variance is always positive, and to improve numerical stability by avoiding extremely large or small values that might arise if the network output without the log.
$$ \newcommand{\mat}[1]{\boldsymbol {#1}} \newcommand{\mattr}[1]{\boldsymbol {#1}^\top} \newcommand{\matinv}[1]{\boldsymbol {#1}^{-1}} \newcommand{\vec}[1]{\boldsymbol {#1}} \newcommand{\vectr}[1]{\boldsymbol {#1}^\top} \newcommand{\rvar}[1]{\mathrm {#1}} \newcommand{\rvec}[1]{\boldsymbol{\mathrm{#1}}} \newcommand{\diag}{\mathop{\mathrm {diag}}} \newcommand{\set}[1]{\mathbb {#1}} \newcommand{\norm}[1]{\left\lVert#1\right\rVert} \newcommand{\pderiv}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\bm}[1]{{\bf #1}} \newcommand{\bb}[1]{\bm{\mathrm{#1}}} $$
Part 2: Generative Adversarial Networks¶
In this part we will implement and train a generative adversarial network and apply it to the task of image generation.
import unittest
import os
import sys
import pathlib
import urllib
import shutil
import re
import zipfile
import numpy as np
import torch
import matplotlib.pyplot as plt
%load_ext autoreload
%autoreload 2
test = unittest.TestCase()
plt.rcParams.update({'font.size': 12})
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('Using device:', device)
Using device: cuda
Obtaining the dataset¶
Let's begin by downloading a dataset of images that we want to learn to generate. We'll use the Labeled Faces in the Wild (LFW) dataset which contains many labeled faces of famous individuals.
We're going to train our generative model to generate a specific face, not just any face. Since the person with the most images in this dataset is former president George W. Bush, we'll set out to train a Bush Generator :)
However, if you feel adventurous and/or prefer to generate something else, feel free to edit the PART2_CUSTOM_DATA_URL variable in hw3/answers.py.
You can use a custom dataset, by editing the PART3_CUSTOM_DATA_URL variable in hw3/answers.py, and get a bonus!
import cs236781.plot as plot
import cs236781.download
from hw3.answers import PART3_CUSTOM_DATA_URL as CUSTOM_DATA_URL
DATA_DIR = pathlib.Path.home().joinpath('.pytorch-datasets')
if CUSTOM_DATA_URL is None:
DATA_URL = 'http://vis-www.cs.umass.edu/lfw/lfw-bush.zip'
else:
DATA_URL = CUSTOM_DATA_URL
_, dataset_dir = cs236781.download.download_data(out_path=DATA_DIR, url=DATA_URL, extract=True, force=False)
File /home/avia.avraham/.pytorch-datasets/George_W_Bush2.zip exists, skipping download. Extracting /home/avia.avraham/.pytorch-datasets/George_W_Bush2.zip...
Extracted 532 to /home/avia.avraham/.pytorch-datasets/George_W_Bush/George_W_Bush
Create a Dataset object that will load the extraced images:
import torchvision.transforms as T
from torchvision.datasets import ImageFolder
im_size = 64
tf = T.Compose([
# Resize to constant spatial dimensions
T.Resize((im_size, im_size)),
# PIL.Image -> torch.Tensor
T.ToTensor(),
# Dynamic range [0,1] -> [-1, 1]
T.Normalize(mean=(.5,.5,.5), std=(.5,.5,.5)),
])
ds_gwb = ImageFolder(os.path.dirname(dataset_dir), tf)
OK, let's see what we got. You can run the following block multiple times to display a random subset of images from the dataset.
_ = plot.dataset_first_n(ds_gwb, 50, figsize=(15,10), nrows=5)
print(f'Found {len(ds_gwb)} images in dataset folder.')
Found 530 images in dataset folder.
x0, y0 = ds_gwb[0]
x0 = x0.unsqueeze(0).to(device)
print(x0.shape)
test.assertSequenceEqual(x0.shape, (1, 3, im_size, im_size))
torch.Size([1, 3, 64, 64])
Generative Adversarial Nets (GANs)¶
GANs, first proposed in a paper by Ian Goodfellow in 2014 are today arguably the most popular type of generative model. GANs are currently producing state of the art results in generative tasks over many different domains.
In a GAN model, two different neural networks compete against each other: A generator and a discriminator.
The Generator, which we'll denote as $\Psi _{\bb{\gamma}} : \mathcal{U} \rightarrow \mathcal{X}$, maps a latent-space variable $\bb{u}\sim\mathcal{N}(\bb{0},\bb{I})$ to an instance-space variable $\bb{x}$ (e.g. an image). Thus a parametric evidence distribution $p_{\bb{\gamma}}(\bb{X})$ is generated, which we typically would like to be as close as possible to the real evidence distribution, $p(\bb{X})$.
The Discriminator, $\Delta _{\bb{\delta}} : \mathcal{X} \rightarrow [0,1]$, is a network which, given an instance-space variable $\bb{x}$, returns the probability that $\bb{x}$ is real, i.e. that $\bb{x}$ was sampled from $p(\bb{X})$ and not $p_{\bb{\gamma}}(\bb{X})$.

Training GANs¶
The generator is trained to generate "fake" instances which will maximally fool the discriminator into returning that they're real. Mathematically, the generator's parameters $\bb{\gamma}$ should be chosen such as to maximize the expression $$ \mathbb{E} _{\bb{z} \sim p(\bb{Z}) } \log (\Delta _{\bb{\delta}}(\Psi _{\bb{\gamma}} (\bb{z}) )). $$
The discriminator is trained to classify between real images, coming from the training set, and fake images generated by the generator. Mathematically, the discriminator's parameters $\bb{\delta}$ should be chosen such as to maximize the expression $$ \mathbb{E} _{\bb{x} \sim p(\bb{X}) } \log \Delta _{\bb{\delta}}(\bb{x}) \, + \, \mathbb{E} _{\bb{z} \sim p(\bb{Z}) } \log (1-\Delta _{\bb{\delta}}(\Psi _{\bb{\gamma}} (\bb{z}) )). $$
These two competing objectives can thus be expressed as the following min-max optimization: $$ \min _{\bb{\gamma}} \max _{\bb{\delta}} \, \mathbb{E} _{\bb{x} \sim p(\bb{X}) } \log \Delta _{\bb{\delta}}(\bb{x}) \, + \, \mathbb{E} _{\bb{z} \sim p(\bb{Z}) } \log (1-\Delta _{\bb{\delta}}(\Psi _{\bb{\gamma}} (\bb{z}) )). $$
A key insight into GANs is that we can interpret the above maximum as the loss with respect to $\bb{\gamma}$:
$$ L({\bb{\gamma}}) = \max _{\bb{\delta}} \, \mathbb{E} _{\bb{x} \sim p(\bb{X}) } \log \Delta _{\bb{\delta}}(\bb{x}) \, + \, \mathbb{E} _{\bb{z} \sim p(\bb{Z}) } \log (1-\Delta _{\bb{\delta}}(\Psi _{\bb{\gamma}} (\bb{z}) )). $$
This means that the generator's loss function trains together with the generator itself in an adversarial manner. In contrast, when training our VAE we used a fixed L2 norm as a data loss term.
Model Implementation¶
We'll now implement a Deep Convolutional GAN (DCGAN) model. See the DCGAN paper for architecture ideas and tips for training.
TODO: Implement the Discriminator class in the hw3/gan.py module.
If you wish you can reuse the EncoderCNN class from the VAE model as the first part of the Discriminator.
import hw3.gan as gan
dsc = gan.Discriminator(in_size=x0[0].shape).to(device)
print(dsc)
d0 = dsc(x0)
print(d0.shape)
test.assertSequenceEqual(d0.shape, (1,1))
Discriminator(
(encoder): Sequential(
(0): Conv2d(3, 64, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
(1): LeakyReLU(negative_slope=0.2, inplace=True)
(2): Conv2d(64, 128, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
(3): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(4): LeakyReLU(negative_slope=0.2, inplace=True)
(5): Conv2d(128, 256, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
(6): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(7): LeakyReLU(negative_slope=0.2, inplace=True)
(8): Conv2d(256, 512, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
(9): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(10): LeakyReLU(negative_slope=0.2, inplace=True)
)
(fc): Linear(in_features=8192, out_features=1, bias=True)
)
torch.Size([1, 1])
TODO: Implement the Generator class in the hw3/gan.py module.
If you wish you can reuse the DecoderCNN class from the VAE model as the last part of the Generator.
z_dim = 128
gen = gan.Generator(z_dim, 4).to(device)
print(gen)
z = torch.randn(1, z_dim).to(device)
xr = gen(z)
print(xr.shape)
test.assertSequenceEqual(x0.shape, xr.shape)
Generator(
(decoder): Sequential(
(0): ConvTranspose2d(128, 512, kernel_size=(4, 4), stride=(1, 1), bias=False)
(1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU(inplace=True)
(3): ConvTranspose2d(512, 256, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
(4): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): ReLU(inplace=True)
(6): ConvTranspose2d(256, 128, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
(7): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(8): ReLU(inplace=True)
(9): ConvTranspose2d(128, 64, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
(10): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(11): ReLU(inplace=True)
(12): ConvTranspose2d(64, 3, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
(13): Tanh()
)
)
torch.Size([1, 3, 64, 64])
Loss Implementation¶
Let's begin with the discriminator's loss function. Based on the above we can flip the sign and say we want to update the Discriminator's parameters $\bb{\delta}$ so that they minimize the expression $$ - \mathbb{E} _{\bb{x} \sim p(\bb{X}) } \log \Delta _{\bb{\delta}}(\bb{x}) \, - \, \mathbb{E} _{\bb{z} \sim p(\bb{Z}) } \log (1-\Delta _{\bb{\delta}}(\Psi _{\bb{\gamma}} (\bb{z}) )). $$
We're using the Discriminator twice in this expression; once to classify data from the real data distribution and once again to classify generated data. Therefore our loss should be computed based on these two terms. Notice that since the discriminator returns a probability, we can formulate the above as two cross-entropy losses.
GANs are notoriously diffucult to train. One common trick for improving GAN stability during training is to make the classification labels noisy for the discriminator. This can be seen as a form of regularization, to help prevent the discriminator from overfitting.
We'll incorporate this idea into our loss function. Instead of labels being equal to 0 or 1, we'll make them "fuzzy", i.e. random numbers in the ranges $[0\pm\epsilon]$ and $[1\pm\epsilon]$.
TODO: Implement the discriminator_loss_fn() function in the hw3/gan.py module.
from hw3.gan import discriminator_loss_fn
torch.manual_seed(42)
y_data = torch.rand(10) * 10
y_generated = torch.rand(10) * 10
loss = discriminator_loss_fn(y_data, y_generated, data_label=1, label_noise=0.3)
print(loss)
test.assertAlmostEqual(loss.item(), 6.4808731, delta=1e-5)
tensor(6.4809)
Similarly, the generator's parameters $\bb{\gamma}$ should minimize the expression $$ -\mathbb{E} _{\bb{z} \sim p(\bb{Z}) } \log (\Delta _{\bb{\delta}}(\Psi _{\bb{\gamma}} (\bb{z}) )) $$
which can also be seen as a cross-entropy term. This corresponds to "fooling" the discriminator; Notice that the gradient of the loss w.r.t $\bb{\gamma}$ using this expression also depends on $\bb{\delta}$.
TODO: Implement the generator_loss_fn() function in the hw3/gan.py module.
from hw3.gan import generator_loss_fn
torch.manual_seed(42)
y_generated = torch.rand(20) * 10
loss = generator_loss_fn(y_generated, data_label=1)
print(loss)
test.assertAlmostEqual(loss.item(), 0.0222969, delta=1e-3)
tensor(0.0223)
Sampling¶
Sampling from a GAN is straightforward, since it learns to generate data from an isotropic Gaussian latent space distribution.
There is an important nuance however. Sampling is required during the process of training the GAN, since we generate fake images to show the discriminator. As you'll seen in the next section, in some cases we'll need our samples to have gradients (i.e., to be part of the Generator's computation graph).
TODO: Implement the sample() method in the Generator class within the hw3/gan.py module.
samples = gen.sample(5, with_grad=False)
test.assertSequenceEqual(samples.shape, (5, *x0.shape[1:]))
test.assertIsNone(samples.grad_fn)
_ = plot.tensors_as_images(samples.cpu())
samples = gen.sample(5, with_grad=True)
test.assertSequenceEqual(samples.shape, (5, *x0.shape[1:]))
test.assertIsNotNone(samples.grad_fn)
Training¶
Training GANs is a bit different since we need to train two models simultaneously, each with it's own separate loss function and optimizer. We'll implement the training logic as a function that handles one batch of data and updates both the discriminator and the generator based on it.
As mentioned above, GANs are considered hard to train. To get some ideas and tips you can see this paper, this list of "GAN hacks" or just do it the hard way :)
TODO:
- Implement the
train_batchfunction in thehw3/gan.pymodule. - Tweak the hyperparameters in the
part3_gan_hyperparams()function within thehw3/answers.pymodule.
import torch.optim as optim
from torch.utils.data import DataLoader
from hw3.answers import part3_gan_hyperparams
torch.manual_seed(42)
# Hyperparams
hp = part3_gan_hyperparams()
batch_size = hp['batch_size']
z_dim = hp['z_dim']
# Data
dl_train = DataLoader(ds_gwb, batch_size, shuffle=True)
im_size = ds_gwb[0][0].shape
# Model
dsc = gan.Discriminator(im_size).to(device)
gen = gan.Generator(z_dim, featuremap_size=4).to(device)
# Optimizer
def create_optimizer(model_params, opt_params):
opt_params = opt_params.copy()
optimizer_type = opt_params['type']
opt_params.pop('type')
return optim.__dict__[optimizer_type](model_params, **opt_params)
dsc_optimizer = create_optimizer(dsc.parameters(), hp['discriminator_optimizer'])
gen_optimizer = create_optimizer(gen.parameters(), hp['generator_optimizer'])
# Loss
def dsc_loss_fn(y_data, y_generated):
return gan.discriminator_loss_fn(y_data, y_generated, hp['data_label'], hp['label_noise'])
def gen_loss_fn(y_generated):
return gan.generator_loss_fn(y_generated, hp['data_label'])
# Training
checkpoint_file = 'checkpoints/gan'
checkpoint_file_final = f'{checkpoint_file}_final'
if os.path.isfile(f'{checkpoint_file}.pt'):
os.remove(f'{checkpoint_file}.pt')
# Show hypers
print(hp)
{'batch_size': 64, 'z_dim': 128, 'betas': (0.5, 0.999), 'discriminator_optimizer': {'type': 'Adam', 'lr': 0.0002, 'betas': (0.5, 0.999)}, 'generator_optimizer': {'type': 'Adam', 'lr': 0.0002, 'betas': (0.5, 0.999)}, 'data_label': 1, 'label_noise': 0.15}
TODO:
- Implement the
save_checkpointfunction in thehw3.ganmodule. You can decide on your own criterion regarding whether to save a checkpoint at the end of each epoch. - Run the following block to train. It will sample some images from your model every few epochs so you can see the progress.
- When you're satisfied with your results, rename the checkpoints file by adding
_final. When you run themain.pyscript to generate your submission, the final checkpoints file will be loaded instead of running training. Note that your final submission zip will not include thecheckpoints/folder. This is OK.
import IPython.display
import tqdm
from hw3.gan import train_batch, save_checkpoint
num_epochs = 100
if os.path.isfile(f'{checkpoint_file_final}.pt'):
print(f'*** Loading final checkpoint file {checkpoint_file_final} instead of training')
num_epochs = 0
gen = torch.load(f'{checkpoint_file_final}.pt', map_location=device)
checkpoint_file = checkpoint_file_final
try:
dsc_avg_losses, gen_avg_losses = [], []
for epoch_idx in range(num_epochs):
# We'll accumulate batch losses and show an average once per epoch.
dsc_losses, gen_losses = [], []
print(f'--- EPOCH {epoch_idx+1}/{num_epochs} ---')
with tqdm.tqdm(total=len(dl_train.batch_sampler), file=sys.stdout) as pbar:
for batch_idx, (x_data, _) in enumerate(dl_train):
x_data = x_data.to(device)
dsc_loss, gen_loss = train_batch(
dsc, gen,
dsc_loss_fn, gen_loss_fn,
dsc_optimizer, gen_optimizer,
x_data)
dsc_losses.append(dsc_loss)
gen_losses.append(gen_loss)
pbar.update()
dsc_avg_losses.append(np.mean(dsc_losses))
gen_avg_losses.append(np.mean(gen_losses))
print(f'Discriminator loss: {dsc_avg_losses[-1]}')
print(f'Generator loss: {gen_avg_losses[-1]}')
if save_checkpoint(gen, dsc_avg_losses, gen_avg_losses, checkpoint_file):
print(f'Saved checkpoint.')
samples = gen.sample(5, with_grad=False)
fig, _ = plot.tensors_as_images(samples.cpu(), figsize=(6,2))
IPython.display.display(fig)
plt.close(fig)
except KeyboardInterrupt as e:
print('\n *** Training interrupted by user')
*** Loading final checkpoint file checkpoints/gan_final instead of training
/tmp/ipykernel_3251108/369316824.py:10: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
gen = torch.load(f'{checkpoint_file_final}.pt', map_location=device)
# Plot images from best or last model
if os.path.isfile(f'{checkpoint_file}.pt'):
gen = torch.load(f'{checkpoint_file}.pt', map_location=device)
print('*** Images Generated from best model:')
samples = gen.sample(n=15, with_grad=False).cpu()
fig, _ = plot.tensors_as_images(samples, nrows=3, figsize=(6,6))
/tmp/ipykernel_3251108/1746051416.py:3: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
gen = torch.load(f'{checkpoint_file}.pt', map_location=device)
*** Images Generated from best model:
Questions¶
TODO Answer the following questions. Write your answers in the appropriate variables in the module hw3/answers.py.
from cs236781.answers import display_answer
import hw3.answers
Question 1¶
Explain in detail why during training we sometimes need to maintain gradients when sampling from the GAN, and other times we don't. When are they maintained and why? When are they discarded and why?
display_answer(hw3.answers.part2_q1)
Your answer: Looking at the loss function of the VAE, we can see that the reconstruction loss is the sum of the squared differences between the input and the output of the decoder, divided by the variance of the input: the $\sigma^2$ hyperparameter. This means that when $\sigma^2$ is low, the VAE is pushed to reconstruct the input more precisely, as is more heacily pentalize high differences. In practice this cal also lead to overfitting, since the model try to memorize the input rateher then learn more general latent features. On the other hand, when $\sigma^2$ is high, the VAE is less strict about the reconstruction, which may allow the model to learn more robust, higher level features, but can also lead to underfitting.
Question 2¶
When training a GAN to generate images, should we decide to stop training solely based on the fact that the Generator loss is below some threshold? Why or why not?
What does it mean if the discriminator loss remains at a constant value while the generator loss decreases?
display_answer(hw3.answers.part2_q2)
Your answer: push the latent space to also be a standard normal distribution.
The reconstruction loss's purpose is to ensure that the model can reconstruct the input data well from the latent space. The KL divergence loss's purpose is to ensure that the latent space distribution is close to the prior distribution.
When the KL loss term is low, the latent space distribution is close to the prior distribution. So for example, if the prior distribution is a standard normal distribution, the KL loss will
- When the latent space distribution is close to the prior distribution, it ensures the latent space is smooth and continuous like the prior distribution, making it possible to sample meaningful latent variables during generation. In addition, it also prevents overfitting by encoding into a more generalizable latent space, and it allows thes interpolation and generation of new data points, because the latent space is similar to the prior distrubution.
Question 3¶
Compare the results you got when generating images with the VAE to the GAN results. What's the main difference and what's causing it?
display_answer(hw3.answers.part2_q3)
Your answer: We maximize the evidence distribution $p(X)$ because in generative modeling we want our model to assign high probability to the observed data. Intuitively, the better the model “explains” or “fits” the data, the larger $p(X)$ becomes. Hence, finding parameters that maximize $p(X)$ yields a model that best represents or generates the data we observe. (In practice we maximize an approximation of the evidence distribution (the ELBO), since the true evidence distribution is intractable to compute.)
$$ \newcommand{\mat}[1]{\boldsymbol {#1}} \newcommand{\mattr}[1]{\boldsymbol {#1}^\top} \newcommand{\matinv}[1]{\boldsymbol {#1}^{-1}} \newcommand{\vec}[1]{\boldsymbol {#1}} \newcommand{\vectr}[1]{\boldsymbol {#1}^\top} \newcommand{\rvar}[1]{\mathrm {#1}} \newcommand{\rvec}[1]{\boldsymbol{\mathrm{#1}}} \newcommand{\diag}{\mathop{\mathrm {diag}}} \newcommand{\set}[1]{\mathbb {#1}} \newcommand{\norm}[1]{\left\lVert#1\right\rVert} \newcommand{\pderiv}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\bb}[1]{\boldsymbol{#1}} $$
Part 3: Transformer¶
In this part we will implement a variation of the attention mechanism named the 'sliding window attention'. Next, we will create a transformer encoder with the sliding-window attention implementation, and we will train the encoder for sentiment analysis.
%load_ext autoreload
%autoreload 2
import unittest
import math
import torch
import torch.nn as nn
import torch.nn.functional as F
import copy
import torch.optim as optim
from tqdm import tqdm
import os
test = unittest.TestCase()
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('Using device:', device)
Using device: cuda
Reminder: scaled dot product attention¶
In class, you saw that the scaled dot product attention is defined as:
$$ \begin{align} \mat{B} &= \frac{1}{\sqrt{d}} \mat{Q}\mattr{K} \ \in\set{R}^{m\times n} \\ \mat{A} &= softmax({\mat{B}},{\mathrm{dim}=1}), \in\set{R}^{m\times n} \\ \mat{Y} &= \mat{A}\mat{V} \ \in\set{R}^{m\times d_v}. \end{align} $$
where K,Q and V for the self attention came as projections of the same input sequnce
$$ \begin{align*} \vec{q}_{i} &= \mat{W}_{xq}\vec{x}_{i} & \vec{k}_{i} &= \mat{W}_{xk}\vec{x}_{i} & \vec{v}_{i} &= \mat{W}_{xv}\vec{x}_{i} \end{align*} $$
If you feel the attention mechanism doesn't quite sit right, we recommend you go over lecture and tutorial notes before proceeding.
We are now going to introduce a slight variation of the scaled dot product attention.
Sliding window attention¶
The scaled dot product attention computes the dot product between every pair of key and query vectors. Therefore, the computation complexity is $O(n^2)$ where $n$ is the sequence length.
In order to obtain a computational complexity that grows linearly with the sequnce length, the authors of 'Longformer: The Long-Document Transformer https://arxiv.org/pdf/2004.05150.pdf' proposed the 'sliding window attention' which is a variation of the scaled dot product attention.
In this variation, instead of computing the dot product for every pair of key and query vectors, the dot product is only computed for keys that are in a certain 'window' around the query vector.
For example, if the keys and queries are embeddings of words in the sentence "CS is more prestigious than EE", and the window size is 2, then for the query corresponding to the word 'is' we will only compute a dot product with the keys that are at most ${window\_size}\over{2}$$ = $${2}\over{2}$$=1$ to the left and to the right. Meaning the keys that correspond to the workds 'CS', 'is' and 'more'. Formally, the intermediate calculation of the normalized dot product can be written as: $$ \mathrm{b}(q, k, w)¶
\begin{cases} q⋅k^T\over{\sqrt{d_k}} & \mathrm{if} \;d(q,k) ≤ {{w}\over{2}} \\ -\infty & \mathrm{otherwise} \end{cases}. $$
Where $b(\cdot,\cdot,\cdot)$ is the intermediate result function (used to construct a matrix $\mat{B}$ on which we perform the softmax), $q$ is the query vector, $k$ is the key vector, $w$ is the sliding window size, and $d(\cdot,\cdot)$ is the distance function between the positions of the tokens corresponding to the key and query vectors.
Note: The distance function $d(\cdot,\cdot)$ is Not cyclical. Meaning that that in the example above when searching for the words at distance 1 from the word 'CS', we don't return cyclically from the right and count the word EE.
The result of this operation can be visualized like this: (green corresponds to computing the scaled dot product, and white to a no-op or $-∞$).

TODO: Implement the sliding_window_attention function in hw3/transformer.py
from hw3.transformer import sliding_window_attention
## test sliding-window attention
num_heads = 3
batch_size = 2
seq_len = 5
embed_dim = 3
window_size = 2
## test without extra dimension for heads
x = torch.arange(seq_len*embed_dim).reshape(seq_len,embed_dim).repeat(batch_size,1).reshape(batch_size, seq_len, -1).float()
values, attention = sliding_window_attention(x, x, x,window_size)
gt_values = torch.load(os.path.join('test_tensors','values_tensor_0_heads.pt'))
test.assertTrue(torch.all(values == gt_values), f'the tensors differ in dims [B,row,col]:{torch.stack(torch.where(values != gt_values),dim=0)}')
gt_attention = torch.load(os.path.join('test_tensors','attention_tensor_0_heads.pt'))
test.assertTrue(torch.all(attention == gt_attention), f'the tensors differ in dims [B,row,col]:{torch.stack(torch.where(attention != gt_attention),dim=0)}')
## test with extra dimension for heads
x = torch.arange(seq_len*embed_dim).reshape(seq_len,embed_dim).repeat(batch_size, num_heads, 1).reshape(batch_size, num_heads, seq_len, -1).float()
values, attention = sliding_window_attention(x, x, x,window_size)
gt_values = torch.load(os.path.join('test_tensors','values_tensor_3_heads.pt'))
test.assertTrue(torch.all(values == gt_values), f'the tensors differ in dims [B,num_heads,row,col]:{torch.stack(torch.where(values != gt_values),dim=0)}')
gt_attention = torch.load(os.path.join('test_tensors','attention_tensor_3_heads.pt'))
test.assertTrue(torch.all(attention == gt_attention), f'the tensors differ in dims [B,num_heads,row,col]:{torch.stack(torch.where(attention != gt_attention),dim=0)}')
/tmp/ipykernel_3251151/2617517882.py:16: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
gt_values = torch.load(os.path.join('test_tensors','values_tensor_0_heads.pt'))
/tmp/ipykernel_3251151/2617517882.py:21: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
gt_attention = torch.load(os.path.join('test_tensors','attention_tensor_0_heads.pt'))
/tmp/ipykernel_3251151/2617517882.py:30: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
gt_values = torch.load(os.path.join('test_tensors','values_tensor_3_heads.pt'))
/tmp/ipykernel_3251151/2617517882.py:34: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
gt_attention = torch.load(os.path.join('test_tensors','attention_tensor_3_heads.pt'))
Multihead Sliding window attention¶
As you've seen in class, the transformer model uses a Multi-head attention module. We will use the same implementation you've seen in the tutorial, aside from the attention mechanism itslef, which will be swapped with the sliding-window attention you implemented.
TODO: Insert the call to the sliding-window attention mechanism in the forward of MultiHeadAttention in hw3/transformer.py
Sentiment analysis¶
We will now go on to tackling the task of sentiment analysis which is the process of analyzing text to determine if the emotional tone of the message is positive or negative (many times a neutral class is also used, but this won't be the case in the data we will be working with).
IMBD hugging face dataset¶
Hugging Face is a popular open-source library and platform that provides state-of-the-art tools and resources for natural language processing (NLP) tasks. It has gained immense popularity within the NLP community due to its user-friendly interfaces, powerful pre-trained models, and a vibrant community that actively contributes to its development.
Hugging Face provides a wide array of tools and utilities, which we will leverage as well. The Hugging Face Transformers library, built on top of PyTorch and TensorFlow, offers a simple yet powerful API for working with Transformer-based models (such as Distil-BERT). It enables users to easily load, fine-tune, and evaluate models, as well as generate text using these models.
Furthermore, Hugging Face offers the Hugging Face Datasets library, which provides access to a vast collection of publicly available datasets for NLP. These datasets can be conveniently downloaded and used for training and evaluation purposes.
You are encouraged to visit their site and see other uses: https://huggingface.co/
import numpy as np
import pandas as pd
import sys
import pathlib
import urllib
import shutil
import re
import matplotlib.pyplot as plt
%load_ext autoreload
%autoreload 2
The autoreload extension is already loaded. To reload it, use: %reload_ext autoreload
from datasets import DatasetDict
from datasets import load_dataset, concatenate_datasets
First, we load the dataset using Hugging Face's datasets library.
Feel free to look around at the full array of datasets that they offer.
https://huggingface.co/docs/datasets/index
We will load the full training and test sets in addition to a small toy subset of the training set.
dataset = load_dataset('imdb', split=['train', 'test', 'train[12480:12520]'])
print(dataset)
[Dataset({
features: ['text', 'label'],
num_rows: 25000
}), Dataset({
features: ['text', 'label'],
num_rows: 25000
}), Dataset({
features: ['text', 'label'],
num_rows: 40
})]
We see that it returned a list of 3 labeled datasets, the first two of size 25,000, and the third of size 40.
We will use these as train and test datasets for training the model, and the toy dataset for a sanity check.
These Datasets are wrapped in a Dataset class.
We now wrap the dataset into a DatasetDict class, which contains helpful methods to use for working with the data.
https://huggingface.co/docs/datasets/package_reference/main_classes#datasets.DatasetDict
#wrap it in a DatasetDict to enable methods such as map and format
dataset = DatasetDict({'train': dataset[0], 'val': dataset[1], 'toy': dataset[2]})
dataset
DatasetDict({
train: Dataset({
features: ['text', 'label'],
num_rows: 25000
})
val: Dataset({
features: ['text', 'label'],
num_rows: 25000
})
toy: Dataset({
features: ['text', 'label'],
num_rows: 40
})
})
We can now access the datasets in the Dict as we would a dictionary. Let's print a few training samples
print(dataset['train'])
for i in range(4):
print(f'TRAINING SAMPLE {i}:')
print(dataset['train'][i]['text'])
label = dataset['train'][i]['label']
print(f'Label {i}: {label}')
print('\n')
Dataset({
features: ['text', 'label'],
num_rows: 25000
})
TRAINING SAMPLE 0:
I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far between, even then it's not shot like some cheaply made porno. While my countrymen mind find it shocking, in reality sex and nudity are a major staple in Swedish cinema. Even Ingmar Bergman, arguably their answer to good old boy John Ford, had sex scenes in his films.<br /><br />I do commend the filmmakers for the fact that any sex shown in the film is shown for artistic purposes rather than just to shock people and make money to be shown in pornographic theaters in America. I AM CURIOUS-YELLOW is a good film for anyone wanting to study the meat and potatoes (no pun intended) of Swedish cinema. But really, this film doesn't have much of a plot.
Label 0: 0
TRAINING SAMPLE 1:
"I Am Curious: Yellow" is a risible and pretentious steaming pile. It doesn't matter what one's political views are because this film can hardly be taken seriously on any level. As for the claim that frontal male nudity is an automatic NC-17, that isn't true. I've seen R-rated films with male nudity. Granted, they only offer some fleeting views, but where are the R-rated films with gaping vulvas and flapping labia? Nowhere, because they don't exist. The same goes for those crappy cable shows: schlongs swinging in the breeze but not a clitoris in sight. And those pretentious indie movies like The Brown Bunny, in which we're treated to the site of Vincent Gallo's throbbing johnson, but not a trace of pink visible on Chloe Sevigny. Before crying (or implying) "double-standard" in matters of nudity, the mentally obtuse should take into account one unavoidably obvious anatomical difference between men and women: there are no genitals on display when actresses appears nude, and the same cannot be said for a man. In fact, you generally won't see female genitals in an American film in anything short of porn or explicit erotica. This alleged double-standard is less a double standard than an admittedly depressing ability to come to terms culturally with the insides of women's bodies.
Label 1: 0
TRAINING SAMPLE 2:
If only to avoid making this type of film in the future. This film is interesting as an experiment but tells no cogent story.<br /><br />One might feel virtuous for sitting thru it because it touches on so many IMPORTANT issues but it does so without any discernable motive. The viewer comes away with no new perspectives (unless one comes up with one while one's mind wanders, as it will invariably do during this pointless film).<br /><br />One might better spend one's time staring out a window at a tree growing.<br /><br />
Label 2: 0
TRAINING SAMPLE 3:
This film was probably inspired by Godard's Masculin, féminin and I urge you to see that film instead.<br /><br />The film has two strong elements and those are, (1) the realistic acting (2) the impressive, undeservedly good, photo. Apart from that, what strikes me most is the endless stream of silliness. Lena Nyman has to be most annoying actress in the world. She acts so stupid and with all the nudity in this film,...it's unattractive. Comparing to Godard's film, intellectuality has been replaced with stupidity. Without going too far on this subject, I would say that follows from the difference in ideals between the French and the Swedish society.<br /><br />A movie of its time, and place. 2/10.
Label 3: 0
We should check the label distirbution:
def label_cnt(type):
ds = dataset[type]
size = len(ds)
cnt= 0
for smp in ds:
cnt += smp['label']
print(f'negative samples in {type} dataset: {size - cnt}')
print(f'positive samples in {type} dataset: {cnt}')
label_cnt('train')
label_cnt('val')
label_cnt('toy')
negative samples in train dataset: 12500 positive samples in train dataset: 12500
negative samples in val dataset: 12500 positive samples in val dataset: 12500 negative samples in toy dataset: 20 positive samples in toy dataset: 20
Import the tokenizer for the dataset¶
Let’s tokenize the texts into individual word tokens using the tokenizer implementation inherited from the pre-trained model class.
With Hugging Face you will always find a tokenizer associated with each model. If you are not doing research or experiments on tokenizers it’s always preferable to use the standard tokenizers.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
print("Tokenizer input max length:", tokenizer.model_max_length)
print("Tokenizer vocabulary size:", tokenizer.vocab_size)
Tokenizer input max length: 512 Tokenizer vocabulary size: 30522
Let's create helper functions to tokenize the text. Notice the arguments sent to the tokenizer.
Padding is a strategy for ensuring tensors are rectangular by adding a special padding token to shorter sentences.
On the other hand , sometimes a sequence may be too long for a model to handle. In this case, you’ll need to truncate the sequence to a shorter length.
def tokenize_text(batch):
return tokenizer(batch["text"], truncation=True, padding=True)
def tokenize_dataset(dataset):
dataset_tokenized = dataset.map(tokenize_text, batched=True, batch_size =None)
return dataset_tokenized
dataset_tokenized = tokenize_dataset(dataset)
# we would like to work with pytorch so we can manually fine-tune
dataset_tokenized.set_format("torch", columns=["input_ids", "attention_mask", "label"])
# no need to parrarelize in this assignment
os.environ["TOKENIZERS_PARALLELISM"] = "false"
Setting up the dataloaders and dataset¶
We will now set up the dataloaders for efficient batching and loading of the data.
By now, you are familiar with the Class methods that are needed to create a working Dataloader.
from torch.utils.data import DataLoader, Dataset
class IMDBDataset(Dataset):
def __init__(self, dataset):
self.ds = dataset
def __getitem__(self, index):
return self.ds[index]
def __len__(self):
return self.ds.num_rows
train_dataset = IMDBDataset(dataset_tokenized['train'])
val_dataset = IMDBDataset(dataset_tokenized['val'])
toy_dataset = IMDBDataset(dataset_tokenized['toy'])
dl_train,dl_val, dl_toy = [
DataLoader(
dataset=train_dataset,
batch_size=12,
shuffle=True,
num_workers=0
),
DataLoader(
dataset=val_dataset,
batch_size=12,
shuffle=True,
num_workers=0
),
DataLoader(
dataset=toy_dataset,
batch_size=4,
num_workers=0
)]
Transformer Encoder¶
The model we will use for the task at hand, is the encoder of the transformer proposed in the seminal paper 'Attention Is All You Need'.
The encoder is composed of positional encoding, and then multiple blocks which compute multi-head attention, layer normalization and a feed forward network as described in the diagram below.

We provided you with implemetations for the positional encoding and the position-wise feed forward MLP in hw3/transformer.py.
Feel free to read through the implementations to make sure you understand what they do.
TODO: To begin with, complete the transformer EncoderLayer in hw3/transformer.py
from hw3.transformer import EncoderLayer
# set torch seed for reproducibility
torch.manual_seed(0)
layer = EncoderLayer(embed_dim=16, hidden_dim=16, num_heads=4, window_size=4, dropout=0.1)
# load x and y
x = torch.load(os.path.join('test_tensors','encoder_layer_input.pt'))
y = torch.load(os.path.join('test_tensors','encoder_layer_output.pt'))
padding_mask = torch.ones(2, 10)
padding_mask[:, 5:] = 0
# forward pass
out = layer(x, padding_mask)
test.assertTrue(torch.allclose(out, y, atol=1e-6), 'output of encoder layer is incorrect')
/tmp/ipykernel_3251151/2496630226.py:7: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
x = torch.load(os.path.join('test_tensors','encoder_layer_input.pt'))
/tmp/ipykernel_3251151/2496630226.py:8: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
y = torch.load(os.path.join('test_tensors','encoder_layer_output.pt'))
In order to classify a sentence using the encoder, we need to somehow summarize the output of the last encoder layer (which will include an output for each token in the tokenized input sentence).
There are several options for doing this. We will use the output of the special token [CLS] appended to the beginning of each sentence by the bert tokenizer we are using.
Let's see an example of the first tokens in a sentence after tokenization:
tokenizer.convert_ids_to_tokens(dataset_tokenized['train'][0]['input_ids'])[:10]
['[CLS]', 'i', 'rented', 'i', 'am', 'curious', '-', 'yellow', 'from', 'my']
TODO: Now it's time to put it all together. Complete the implementaion of 'Encoder' in hw3/transformer.py
from hw3.transformer import Encoder
# set torch seed for reproducibility
torch.manual_seed(0)
encoder = Encoder(vocab_size=100, embed_dim=16, num_heads=4, num_layers=3,
hidden_dim=16, max_seq_length=64, window_size=4, dropout=0.1)
# load x and y
x = torch.load(os.path.join('test_tensors','encoder_input.pt'))
y = torch.load(os.path.join('test_tensors','encoder_output.pt'))
# x = torch.randint(0, 100, (2, 64)).long()
padding_mask = torch.ones(2, 64)
padding_mask[:, 50:] = 0
# forward pass
out = encoder(x, padding_mask)
test.assertTrue(torch.allclose(out, y, atol=1e-6), 'output of encoder layer is incorrect')
/tmp/ipykernel_3251151/364316187.py:10: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
x = torch.load(os.path.join('test_tensors','encoder_input.pt'))
/tmp/ipykernel_3251151/364316187.py:11: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
y = torch.load(os.path.join('test_tensors','encoder_output.pt'))
Training the encoder¶
We will now proceed to train the model.
TODO: Complete the implementation of TransformerEncoderTrainer in hw3/training.py
Training on a toy dataset¶
To begin with, we will train on a small toy dataset of 40 samples. This will serve as a sanity check to make sure nothing is buggy.
TODO: choose the hyperparameters in hw3.answers part3_transformer_encoder_hyperparams.
from hw3.answers import part4_transformer_encoder_hyperparams
params = part4_transformer_encoder_hyperparams()
print(params)
embed_dim = params['embed_dim']
num_heads = params['num_heads']
num_layers = params['num_layers']
hidden_dim = params['hidden_dim']
window_size = params['window_size']
dropout = params['droupout']
lr = params['lr']
vocab_size = tokenizer.vocab_size
max_seq_length = tokenizer.model_max_length
max_batches_per_epoch = None
N_EPOCHS = 20
{'embed_dim': 256, 'num_heads': 4, 'num_layers': 6, 'hidden_dim': 512, 'window_size': 32, 'droupout': 0.1, 'lr': 0.0001}
toy_model = Encoder(vocab_size, embed_dim, num_heads, num_layers, hidden_dim, max_seq_length, window_size, dropout=dropout).to(device)
toy_optimizer = optim.Adam(toy_model.parameters(), lr=lr)
criterion = nn.BCEWithLogitsLoss()
# fit your model
import pickle
if not os.path.exists('toy_transfomer_encoder.pt'):
# overfit
from hw3.training import TransformerEncoderTrainer
toy_trainer = TransformerEncoderTrainer(toy_model, criterion, toy_optimizer)
# set max batches per epoch
_ = toy_trainer.fit(dl_toy, dl_toy, N_EPOCHS, checkpoints='toy_transfomer_encoder', max_batches=max_batches_per_epoch)
toy_saved_state = torch.load('toy_transfomer_encoder.pt')
toy_best_acc = toy_saved_state['best_acc']
toy_model.load_state_dict(toy_saved_state['model_state'])
/tmp/ipykernel_3251151/731955017.py:12: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
toy_saved_state = torch.load('toy_transfomer_encoder.pt')
<All keys matched successfully>
test.assertTrue(toy_best_acc >= 95)
Training on all data¶
Congratulations! You are now ready to train your sentiment analysis classifier!
max_batches_per_epoch = 500
N_EPOCHS = 4
model = Encoder(vocab_size, embed_dim, num_heads, num_layers, hidden_dim, max_seq_length, window_size, dropout).to(device)
optimizer = optim.Adam(model.parameters(), lr=lr)
# fit your model
import pickle
if not os.path.exists('trained_transfomer_encoder.pt'):
from hw3.training import TransformerEncoderTrainer
trainer = TransformerEncoderTrainer(model, criterion, optimizer)
# set max batches per epoch
_ = trainer.fit(dl_train, dl_val, N_EPOCHS, checkpoints='trained_transfomer_encoder', max_batches=max_batches_per_epoch)
saved_state = torch.load('trained_transfomer_encoder.pt')
best_acc = saved_state['best_acc']
model.load_state_dict(saved_state['model_state'])
print("best_acc: ", best_acc)
/tmp/ipykernel_3251151/776851505.py:10: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
saved_state = torch.load('trained_transfomer_encoder.pt')
best_acc: 68.4
test.assertTrue(best_acc >= 65)
Run the follwing cells to see an example of the model output:
rand_index = torch.randint(len(dataset_tokenized['val']), (1,))
rand_index
tensor([21266])
sample = dataset['val'][rand_index]
sample['text']
['Alfred Hitchcock\'s "Saboteur" (1942) (not to be confused with another Hitchcock film, "Sabotage" made in 1936 which has a completely different plot) is not the Master of Suspense greatest film. It dose not have the depth of "Vertigo" (1958) nor the brooding atmosphere of "Rebbeca" (1940) and it certainly dose not have the emotional impact and acting talent that can be found in "Notorious" (1946), but what it dose have is thrills, adventure and a nail biting climax. The two leads, admittedly, are quite weak. It is easy to understand why Hitchcock wanted Gary Cooper for the Robert Cummings role and Barbara Stanwyck for the role that was taken by Priscilla Lane. Also, the patriotic speeches that Cuumming says (that were written by Dorothy Parker) have dated badly, and the encounter with circus troupe is poorly done (the beard on the bearded lady is clearly false). However, the last half hour is edge-of-your-seat viewing, the climax atop the Stature of Liberty is very well done, and the film is a clear predecessor of "North By Northwest" (1959). The two villains, Otto Kruger and Norman Lloyd are very good and the beginning fire at the Aircraft Factory is a superb sequence. This not the best film Hitch made, but it is surly one of his most entertaining.']
tokenized_sample = dataset_tokenized['val'][rand_index]
tokenized_sample
input_ids = tokenized_sample['input_ids'].to(device)
label = tokenized_sample['label'].to(device)
attention_mask = tokenized_sample['attention_mask'].to(float).to(device)
print('label', label.shape)
print('attention_mask', attention_mask.shape)
prediction = model.predict(input_ids, attention_mask).squeeze(0)
print('label: {}, prediction: {}'.format(label, prediction))
label torch.Size([1]) attention_mask torch.Size([1, 512])
label: tensor([1], device='cuda:0'), prediction: tensor([0.], device='cuda:0', grad_fn=<SqueezeBackward1>)
In the next part you wil see how to fine-tune a pretrained model for the same task.
from cs236781.answers import display_answer
import hw3.answers
Questions¶
Fill your answers in hw3.answers.part3_q1 and hw3.answers.part3_q2
Question 1¶
Explain why stacking encoder layers that use the sliding-window attention results in a broader context in the final layer. Hint: Think what happens when stacking CNN layers.
display_answer(hw3.answers.part4_q1)
Your answer: When stacking encoder layers that use the sliding-window attention, the final layer has a broader context due to the receptive field expansion with each layer. For example, when window size is 2, in the first layer, each token attends to itself and one token to the left and right. But in the second layer, the representation of each token already incorporates information from its immediate neighbors, so if each token in the first layer attends to 3 tokens and the second layer allows a token to indirectly attend to neighbors of its neighbors, it's effectively covering a window of up to 5 tokens. And so on until the final layer, which has the widest context.
Question 2¶
Propose a variation of the attention pattern such that the computational complexity stays similar to that of the sliding-window attention O(nw), but the attention is computed on a more global context. Note: There is no single correct answer to this, feel free to read the paper that proposed the sliding-window. Any solution that makes sense will be considered correct.
display_answer(hw3.answers.part4_q2)
Your answer: We suggest a slight modification to the sliding window attention mechanism that intends to make the context more global: we still define a window size, but instead of simply attending to the tokens within the window, we also define a stride parameter, which determines the number of tokens to skip between each token in the window. The number of tokens to attend to will still be limited by the window size (hence the complexity remains O(nw)), but the stride will allow the model to attend to tokens that are further apart from each other, and thus increase the global context of the attention mechanism. Also, since the stride remains constant, it will allow coverage of the entire sequence over all tokens.
import numpy as np
import pandas as pd
import torch
import unittest
import os
import sys
import pathlib
import urllib
import shutil
import re
import numpy as np
import torch
import matplotlib.pyplot as plt
import pickle
%load_ext autoreload
%autoreload 2
from torch.utils.data import DataLoader, Dataset
import numpy as np
from datasets import DatasetDict
from datasets import load_dataset, concatenate_datasets
from hw3 import training
from cs236781.plot import plot_fit
from cs236781.train_results import FitResult
$$ \newcommand{\mat}[1]{\boldsymbol {#1}} \newcommand{\mattr}[1]{\boldsymbol {#1}^\top} \newcommand{\matinv}[1]{\boldsymbol {#1}^{-1}} \newcommand{\vec}[1]{\boldsymbol {#1}} \newcommand{\vectr}[1]{\boldsymbol {#1}^\top} \newcommand{\rvar}[1]{\mathrm {#1}} \newcommand{\rvec}[1]{\boldsymbol{\mathrm{#1}}} \newcommand{\diag}{\mathop{\mathrm {diag}}} \newcommand{\set}[1]{\mathbb {#1}} \newcommand{\norm}[1]{\left\lVert#1\right\rVert} \newcommand{\pderiv}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\bb}[1]{\boldsymbol{#1}} $$
Part 4: Fine-Tuning a pretrained language model¶
In this part , we will deal with the fine-tuning of BERT for sentiment analysis on the IMDB movie reivews dataset from the previous section.
BERT is a large language model developed by Google researchers in 2019 that offers a good balance between popularity and model size, which can be fine-tuned using a simple GPU.
If you aren't yet familiar, you can check it out here:
https://arxiv.org/pdf/1810.04805.pdf.
(Read Section 3 for details on the model architecture and fine-tuning on downstream tasks).
In particular, we will use the distilled (smaller) version of BERT, called Distil-BERT. Distil-BERT is widely used in production since it has 40% fewer parameters than BERT, while running 60% faster and retaining 95% of the performance in many benchmarks. It is recommended to glance through the Distil-BERT paper to get a feel for the model architecture and how it differs from BERT: https://arxiv.org/pdf/1910.01108.pdf
We will download a pre-trained Distil-BERT from Hugging Face, so there is no need to train it from scratch.
One of the key strengths of Hugging Face is its extensive collection of pre-trained models. These models are trained on large-scale datasets and exhibit impressive performance on various NLP tasks, such as text classification, named entity recognition, sentiment analysis, machine translation, and question answering, among others. The pre-trained models provided by Hugging Face can be easily fine-tuned for specific downstream tasks, saving significant time and computational resources.
Loading the Dataset¶
We will now load and prepare the IMDB dataset as we did in the previous part.
Here we will load the full training and test set.
dataset = load_dataset('imdb', split=['train', 'test[12260:12740]'])
print(dataset)
[Dataset({
features: ['text', 'label'],
num_rows: 25000
}), Dataset({
features: ['text', 'label'],
num_rows: 480
})]
#wrap it in a DatasetDict to enable methods such as map and format
dataset = DatasetDict({'train': dataset[0], 'test': dataset[1]})
dataset
DatasetDict({
train: Dataset({
features: ['text', 'label'],
num_rows: 25000
})
test: Dataset({
features: ['text', 'label'],
num_rows: 480
})
})
We can now access the datasets in the Dict as we would a dictionary. Let's print a few training samples
for i in range(4):
print(f'TRAINING SAMPLE {i}:')
print(dataset['train'][i]['text'])
label = dataset['train'][i]['label']
print(f'Label {i}: {label}')
print('\n')
TRAINING SAMPLE 0: I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far between, even then it's not shot like some cheaply made porno. While my countrymen mind find it shocking, in reality sex and nudity are a major staple in Swedish cinema. Even Ingmar Bergman, arguably their answer to good old boy John Ford, had sex scenes in his films.<br /><br />I do commend the filmmakers for the fact that any sex shown in the film is shown for artistic purposes rather than just to shock people and make money to be shown in pornographic theaters in America. I AM CURIOUS-YELLOW is a good film for anyone wanting to study the meat and potatoes (no pun intended) of Swedish cinema. But really, this film doesn't have much of a plot. Label 0: 0 TRAINING SAMPLE 1: "I Am Curious: Yellow" is a risible and pretentious steaming pile. It doesn't matter what one's political views are because this film can hardly be taken seriously on any level. As for the claim that frontal male nudity is an automatic NC-17, that isn't true. I've seen R-rated films with male nudity. Granted, they only offer some fleeting views, but where are the R-rated films with gaping vulvas and flapping labia? Nowhere, because they don't exist. The same goes for those crappy cable shows: schlongs swinging in the breeze but not a clitoris in sight. And those pretentious indie movies like The Brown Bunny, in which we're treated to the site of Vincent Gallo's throbbing johnson, but not a trace of pink visible on Chloe Sevigny. Before crying (or implying) "double-standard" in matters of nudity, the mentally obtuse should take into account one unavoidably obvious anatomical difference between men and women: there are no genitals on display when actresses appears nude, and the same cannot be said for a man. In fact, you generally won't see female genitals in an American film in anything short of porn or explicit erotica. This alleged double-standard is less a double standard than an admittedly depressing ability to come to terms culturally with the insides of women's bodies. Label 1: 0 TRAINING SAMPLE 2: If only to avoid making this type of film in the future. This film is interesting as an experiment but tells no cogent story.<br /><br />One might feel virtuous for sitting thru it because it touches on so many IMPORTANT issues but it does so without any discernable motive. The viewer comes away with no new perspectives (unless one comes up with one while one's mind wanders, as it will invariably do during this pointless film).<br /><br />One might better spend one's time staring out a window at a tree growing.<br /><br /> Label 2: 0 TRAINING SAMPLE 3: This film was probably inspired by Godard's Masculin, féminin and I urge you to see that film instead.<br /><br />The film has two strong elements and those are, (1) the realistic acting (2) the impressive, undeservedly good, photo. Apart from that, what strikes me most is the endless stream of silliness. Lena Nyman has to be most annoying actress in the world. She acts so stupid and with all the nudity in this film,...it's unattractive. Comparing to Godard's film, intellectuality has been replaced with stupidity. Without going too far on this subject, I would say that follows from the difference in ideals between the French and the Swedish society.<br /><br />A movie of its time, and place. 2/10. Label 3: 0
We should also check the label distribution:
def label_cnt(type):
ds = dataset[type]
size = len(ds)
cnt= 0
for smp in ds:
cnt += smp['label']
print(f'negative samples in {type} dataset: {size - cnt}')
print(f'positive samples in {type} dataset: {cnt}')
label_cnt('train')
label_cnt('test')
negative samples in train dataset: 12500 positive samples in train dataset: 12500 negative samples in test dataset: 240 positive samples in test dataset: 240
Import the tokenizer for the dataset¶
We will now tokenize the text the same way we did in the previous part.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
print("Tokenizer input max length:", tokenizer.model_max_length)
print("Tokenizer vocabulary size:", tokenizer.vocab_size)
Tokenizer input max length: 512 Tokenizer vocabulary size: 30522
def tokenize_text(batch):
return tokenizer(batch["text"], truncation=True, padding=True)
def tokenize_dataset(dataset):
dataset_tokenized = dataset.map(tokenize_text, batched=True, batch_size =None)
return dataset_tokenized
dataset_tokenized = tokenize_dataset(dataset)
# we would like to work with pytorch so we can manually fine-tune
dataset_tokenized.set_format("torch", columns=["input_ids", "attention_mask", "label"])
# no need to parrarelize in this assignment
os.environ["TOKENIZERS_PARALLELISM"] = "false"
Setting up the dataloaders and dataset¶
We will now set up the dataloaders for efficient batching and loading of the data.
By now, you are familiar with the Class methods that are needed to create a working Dataloader.
class IMDBDataset(Dataset):
def __init__(self, dataset):
self.ds = dataset
def __getitem__(self, index):
return self.ds[index]
def __len__(self):
return self.ds.num_rows
train_dataset = IMDBDataset(dataset_tokenized['train'])
test_dataset = IMDBDataset(dataset_tokenized['test'])
n_workers= 0
dl_train,dl_test = [
DataLoader(
dataset=train_dataset,
batch_size=12,
shuffle=True,
num_workers=n_workers
),
DataLoader(
dataset=test_dataset,
batch_size=12,
shuffle=False,
num_workers=n_workers
)]
dl_train
<torch.utils.data.dataloader.DataLoader at 0x7fe1c0799ee0>
Importing the model from Hugging Face¶
We will now delve into the process of loading the DistilBERT model from Hugging Face. DistilBERT is a distilled version of the BERT model, offering a lighter and faster alternative while retaining considerable performance on various NLP tasks.
Please refer to the introduction to check out the relevant papers.
For more info on how to use this model, feel free to check it out on the site:
https://huggingface.co/distilbert-base-uncased
To begin, we will import the necessary library required for our implementation.
It is fine if you receive a warning from Hugging Face to train the model on a downstream task, which is exactly what we will do on our IMDB dataset.
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(
"distilbert-base-uncased", num_labels=2)
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Let's print the model architecture to see what we are dealing with:
model
DistilBertForSequenceClassification(
(distilbert): DistilBertModel(
(embeddings): Embeddings(
(word_embeddings): Embedding(30522, 768, padding_idx=0)
(position_embeddings): Embedding(512, 768)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(transformer): Transformer(
(layer): ModuleList(
(0-5): 6 x TransformerBlock(
(attention): DistilBertSdpaAttention(
(dropout): Dropout(p=0.1, inplace=False)
(q_lin): Linear(in_features=768, out_features=768, bias=True)
(k_lin): Linear(in_features=768, out_features=768, bias=True)
(v_lin): Linear(in_features=768, out_features=768, bias=True)
(out_lin): Linear(in_features=768, out_features=768, bias=True)
)
(sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(ffn): FFN(
(dropout): Dropout(p=0.1, inplace=False)
(lin1): Linear(in_features=768, out_features=3072, bias=True)
(lin2): Linear(in_features=3072, out_features=768, bias=True)
(activation): GELUActivation()
)
(output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
)
)
)
)
(pre_classifier): Linear(in_features=768, out_features=768, bias=True)
(classifier): Linear(in_features=768, out_features=2, bias=True)
(dropout): Dropout(p=0.2, inplace=False)
)
Fine Tuning¶
We will now move on to the process of fine-tuning the model that we previously loaded from Hugging Face. Fine-tuning allows us to adapt the pre-trained model to our specific NLP task by further training it on task-specific data. This process enhances the model's performance and enables it to make more accurate predictions on our target task.
There are generally two approaches to fine-tuning the loaded model, each with its own advantages and considerations:
Freeze all the weights besides the last two linear layers and train only those layers:
This approach is commonly referred to as "transfer learning" or "feature extraction." By freezing the weights of the majority of the model's layers, we retain the pre-trained knowledge captured by the model, allowing it to extract useful features from our data. We then replace and train the final few layers, typically linear layers, to adapt the model to our specific task. This method is beneficial when we have limited labeled data or when the pre-trained model has been trained on a similar domain.Retrain all the parameters in the model:
This approach involves unfreezing and training all the parameters of the loaded model, including the pre-trained layers. By retraining all the parameters, we allow the model to adjust its representations and update its knowledge based on our specific task and data. This method is often preferred when we have sufficient labeled data available and want the model to learn task-specific features from scratch or when the pre-trained model's knowledge may not be directly applicable to our domain.
Fine-tuning method 1¶
Freeze all the weights besides the last two linear layers and train only those layers
# TODO:
# Freeze all parameters except for the last 2 linear layers
# ====== YOUR CODE: ======
# start by freezing all layers (freezing everything and then selectively unfreezing is a more robust approach)
for name, param in model.named_parameters():
param.requires_grad = False # i.e, no need to modify
# unfreeze the last two linear layers (only the last two are linear and thus has on of these names)
for name, param in model.named_parameters():
if "pre_classifier" in name or "classifier" in name:
param.requires_grad = True
# ========================
# HINT: use the printed model architecture to get the layer names
Training¶
We can use our abstract Trainer class to fine-tune the model:
We will not play around with hyperparameters in this section, as the point is to learn to fine-tune a model.
In addition, we do not need to send our own loss function for this loaded model (try to understand why).
TODO: Implement the FineTuningTrainer in hw3/training.py
We will train the model for 2 epochs of 40 batches.
You can run this either locally or on the course servers, whichever is most comfortable for you.
from hw3 import training
optimizer = torch.optim.Adam(model.parameters(), lr = 5e-5)
# fit your model
if not os.path.exists('finetuned_last_2.pt'):
trainer = training.FineTuningTrainer(model, loss_fn = None, optimizer = optimizer)
fit_result = trainer.fit(dl_train,dl_test, checkpoints='finetuned_last_2', num_epochs=2, max_batches= 40)
with open('fit_result_finetune_2.pkl', 'wb') as f:
pickle.dump(fit_result, f)
saved_state = torch.load('finetuned_last_2.pt')
model.load_state_dict(saved_state['model_state'])
best_acc = saved_state['best_acc']
print('best acc:', best_acc)
with open('fit_result_finetune_2.pkl', 'rb') as f:
fit_result = pickle.load(f)
/tmp/ipykernel_3251225/474009640.py:13: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
saved_state = torch.load('finetuned_last_2.pt')
best acc: 75.20833333333333
plot_fit(fit_result)
(<Figure size 1600x1000 with 4 Axes>,
array([<Axes: title={'center': 'train_loss'}, xlabel='Iteration #', ylabel='Loss'>,
<Axes: title={'center': 'train_acc'}, xlabel='Epoch #', ylabel='Accuracy (%)'>,
<Axes: title={'center': 'test_loss'}, xlabel='Iteration #', ylabel='Loss'>,
<Axes: title={'center': 'test_acc'}, xlabel='Epoch #', ylabel='Accuracy (%)'>],
dtype=object))
Fine-tuning method 2¶
Retraining all the parameters in the model
We will reload the model to ensure that the parameters are untouched and we are starting from scratch
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(
"distilbert-base-uncased", num_labels=2)
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
model
DistilBertForSequenceClassification(
(distilbert): DistilBertModel(
(embeddings): Embeddings(
(word_embeddings): Embedding(30522, 768, padding_idx=0)
(position_embeddings): Embedding(512, 768)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(transformer): Transformer(
(layer): ModuleList(
(0-5): 6 x TransformerBlock(
(attention): DistilBertSdpaAttention(
(dropout): Dropout(p=0.1, inplace=False)
(q_lin): Linear(in_features=768, out_features=768, bias=True)
(k_lin): Linear(in_features=768, out_features=768, bias=True)
(v_lin): Linear(in_features=768, out_features=768, bias=True)
(out_lin): Linear(in_features=768, out_features=768, bias=True)
)
(sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(ffn): FFN(
(dropout): Dropout(p=0.1, inplace=False)
(lin1): Linear(in_features=768, out_features=3072, bias=True)
(lin2): Linear(in_features=3072, out_features=768, bias=True)
(activation): GELUActivation()
)
(output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
)
)
)
)
(pre_classifier): Linear(in_features=768, out_features=768, bias=True)
(classifier): Linear(in_features=768, out_features=2, bias=True)
(dropout): Dropout(p=0.2, inplace=False)
)
# TODO: Make sure all the model parameters are unfrozen
# ====== YOUR CODE: ======
for name, param in model.named_parameters():
param.requires_grad = True # i.e, NEED to modify
# ========================
optimizer = torch.optim.Adam(model.parameters(), lr = 5e-5)
# fit your model
if not os.path.exists('finetuned_all.pt'):
trainer = training.FineTuningTrainer(model, loss_fn = None, optimizer = optimizer)
fit_result = trainer.fit(dl_train,dl_test, checkpoints='finetuned_all', num_epochs=2, max_batches= 40)
with open('finetuned_all.pkl', 'wb') as f:
pickle.dump(fit_result, f)
saved_state = torch.load('finetuned_all.pt')
model.load_state_dict(saved_state['model_state'])
with open('finetuned_all.pkl', 'rb') as f:
fit_result = pickle.load(f)
/tmp/ipykernel_3251225/156929915.py:11: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
saved_state = torch.load('finetuned_all.pt')
plot_fit(fit_result)
(<Figure size 1600x1000 with 4 Axes>,
array([<Axes: title={'center': 'train_loss'}, xlabel='Iteration #', ylabel='Loss'>,
<Axes: title={'center': 'train_acc'}, xlabel='Epoch #', ylabel='Accuracy (%)'>,
<Axes: title={'center': 'test_loss'}, xlabel='Iteration #', ylabel='Loss'>,
<Axes: title={'center': 'test_acc'}, xlabel='Epoch #', ylabel='Accuracy (%)'>],
dtype=object))
Questions¶
Fill out your answers in hw3.answers.part4_q1 and hw3.answers.part4_q2
from cs236781.answers import display_answer
import hw3.answers
Question 1¶
Explain the results that you got here in comparison to the results achieved in the trained from scratch encoder from the preivous part.
If one of the models performed better, why was this so?
Will this always be the case on any downstream task, or was this phenomenom specific to this task?
display_answer(hw3.answers.part5_q1)
Your answer: Both our fine-tuned models performed better than the trained from scratch model from the previous part (both fine-tuned model gained accuracy of over 80%, while the model from the previous part gained accuracy of a little below 70%). It likely happened because using the method of fine-tuning allowed the model to adapt its pre-trained knowledge entirely to the task, and focus on optimizing both general and task-specific representations, while training from scratch requires learning everything from the ground up, including basic language understanding, which is less efficient. This won't always necessarily be the case on any downstream task. For example, if the downstream task domain significantly differs from the pre-training data, fine-tuning may not be as effective.
Question 2¶
Assume that when fine-tuning, instead of freezing the internal model layers and leaving the last 2 layers unfrozen, we instead froze the last layers and fine-tuned internal layers such as the multi-headed attention block .
Would the model still be able to succesfully fine-tune to this task?
Or would the results be worse?
Explain
display_answer(hw3.answers.part5_q2)
Your answer: The results of freezing the last layers and fine-tuning only the last layers will likely be worse than fine-tuning the freezing the internal ones and fine-tuning the last ones. This is because the internal layers are responsible for general features extraction, like capturing syntax and semantics, while the last layers are the ones that are closer to the output space and are designed to produce task-specific representations, thus freezing them will significantly limit the model's ability to adapt to the new task, which is the basic idea behind fine-tuning.
Question 3¶
If you want to conduct a machine translation task, as seen in the tutorials, can you use BERT?
Describe the modulation you need to do, i.e. if the source tokens are $x_t$ and the target are $y_t$, how would the model work to produce the translation?
If the model can't handle this task, describe the architecture changes and why you need them. If a change in the pre-training is required, describe it as well.
display_answer(hw3.answers.part5_q3)
Your answer: To use BERT for machine translation, we'd need to make changes to the model architecture and pre-training. First, BERT is a bidirectional encoder only model, but machine translation requires an encoder-decoder architecture - the encoder will process the input tokens from the source language ($x_t$), and the decoder will generate the output tokens ($y_t$), using previously generated tokens as input.
A change in the pre-training (or fine tune) is required, since we'd need to train the model on two languages in parallel, in order to teach the model how to translate between them.
Question 4¶
We saw in the course two types on sequntial modeling: RNN and Transformers.
What could be the main reason to choose RNN over a Transformer? Note that both can be light weight or heavy in computation.
display_answer(hw3.answers.part5_q4)
Your answer: RNNs are a good choice for tasks with strong temporal dependencies, where the current output heavily relies on previous inputs, such as speech recognition or real-time sensor data. They process sequences step-by-step in order, making them a good fit for data with a clear progression over time. Additionally, RNNs are more memory-efficient than Transformers when working with very long sequences, as they don’t require computing attention over all input tokens at once. This makes them a better choice for limited resources or variable-length data.
Question 5¶
We have learned that BERT uses "Next Sentence Prediction" (NSP) as part of the pre-training tasks.
Describe what it is (where is the prediction accure, what is the loss).
Do you think this is a crucial part of pre-training? try to analize why you gave the answer, i.e. what essensity it gives to the model, or why it's implicitly don't contibute much.
display_answer(hw3.answers.part5_q5)
Your answer: NSP is a pre-trained task in BERT where the model is trained to predict whether two sentences are consecutive or sampled independently from the dataset. The prediction occurs at the CLS token, a special token added at the start of the input sequence, which serves as a summary representation of the entire input pair for classification tasks. The loss is the binary cross-entropy loss, where the labels indicate whether the two sentences are consecutive or not. While the NSP can help BERT understand sentence relationships, which is useful for tasks like sentence prediction and answering questions, it may not be crucial because language modeling already captures much of this implicitly. We may say that its importance depends on how much sentence-level reasoning the downstream tasks require.